Journal of Applied Psychology 


Joun G. Dartey, Editor 
UNIVERSITY or MINNESOTA 


Lorraine Boutuitet, Managing Editor 





Table of Contents 
A Test of the Effects of Pregnenolone Methyl Ether on Subjective Feelings of B-29 Crews After 
a Twelve-Hour Mission: S. B. Sells, J. R. Barry, D. K. Trites, and H. I. Chinn........ 353 


The Effect of Scale Interval Length and Pointer Clearance on Speed and Accuracy of Inter- 
polation: A. V. Churchill 


Transfer of Training Between Quickened and Unquickened Tracking Systems: J. G. Holland 
and J. B. Henson 


Theory and Analysis of Component Errors in Aided Pursuit Tracking in Relation to Target 
Speed and Aided-Tracking Time Constant: J. R. Simon and K. U. Smith 


Ability Grouping in Army Basic Combat Training: D. C. Findlay, S. M. Matyas, and H. Rogge 


Differentiation of Individuals in Terms of Their Predictability: E. E. Ghiselli 

Optimum Letter Size for a Given Display Area: C. S. Bridgman and E. A. Wade 

Empirical Assessment of Handrail Diameters: N. B. Hall, Jr. and E. M. Bennett 

Personal History Data as a Predictor of Success in Service Station Management: R. S. Soar. 383 


Interest Scores in Identifying the Potential Trade School Dropout: C. O. Samuelson and D. T. 
Pearson, Sr 


The Naval Knowledge Test: A. S. Glickman 
Development of a Structured Disguised Personality Test: B. M. Bass 
A Scale Measuring Attitudes Toward Working for the Government: B. P. Aalto 


Evaluation of a Supervisory Training Program with How Supervise?: R. P. Barthol and M. 
Zeigler 


An Item Analysis of How Supervise? Using Both Internal and External Criteria: R. L. Decker. 406 


Preference Measurement by the Methods of Successive Intervals and Monetary Estimates: 
P. H. Benson and J. H. Platten, Jr 


The Relationship Between Chi Square and Size of Sample: the General Case: H. D. Kimmel... 415 





This is the last issue of Volume 40. 
Volume Title Page and Contents appear herein. 





American Psychological Association 


Volume 40, Number 6 December, 1956 





Consulting Editors 


Harold E. Burtt, Ohio State University 

Alphonse Chapanis, Johns Hopkins Univer- 
sity 

Clifford E. Jurgensen, Minneapolis Gas 
Company 

Laurence S. McGaughran, University of 
Houston 

Quinn McNemar, Stanford University 


Alexander Mintz, City College of New York 
Harold F. Rothe, Fairbanks, Morse and 
Company 
Julian B. Rotter, Ohio State University 
Thomas A. Ryan, Cornell University 
Donald E. Super, Columbia University 
Miles A. Tinker, University of Minnesota 
Alfred C. Welch, University of New Mexico 





This journal gives primary consideration to origi- 
nal investigations in any field of applied psychol- 
ogy except clinical and consulting psychology, al- 
though a descriptive or theoretical article may be 
accepted if it represents a special contribution in 
an applied field. Quantitative investigations of in- 
terest or value to psychologists working in the fol- 
lowing broad fields will be considered: vocational 
and educational prognosis, diagnosis, and guidance 
at the secondary and college level; personnel re- 
search in business, industry, and government; bio- 
mechanics; industrial working conditions; research 
on opinion and morale factors; job analysis and 
classification research; market and advertising re- 
search. 


Because of the large number of manuscripts sub- 
mitted, authors should adhere to the rule of 


“brevity consistent with clarity.” The typical 
manuscript should run to approximately 4,000 
words. There is a lag of approximately twelve 
months between receipt and publication of an 
article. Authors may request advanced publica- 
tion if they are prepared to pay the cost of print- 
ing the necessary extra pages. 


Manuscripts should be addressed to the Editor, 
John G. Darley, 408 Johnston Hall, University of 
Minnesota, Minneapolis 14, Minnesota. All manu- 
scripts should be submitted in duplicate. Original 
figures are prepared for publication; duplicate fig- 
ures may be photographic or pencil-drawn copies. 


Manuscripts must conform to the style require- 
ments described in the “Publication Manual of the 
American Psychological Association,” Psychol. Bull., 
1952, 49, No. 4, Part 2. 





Journal of Applied Psychology 


Published bimonthly by the 


American Psychological Association 
Prince and Lemon Sts., Lancaster, Pa. 
and 1333 Sixteenth Street N.W. 
Washington 6, D. C. 


$8.00 per volume 


$1.50 per issue 


Subscriptions, orders, and business communications should be addressed to the American Psychological Association, 
1333 Sixteenth St. N.W., Washington 6, D. C. Address changes must reach the subscription office by the 10th of 
the month to take effect the following month. Undelivered copies resulting from address changes will not be replaced; 


subscribers should notify the post office that they will guarantee second-class forwarding postage. 


Other claims for 


undelivered copies must be made within four months of publication. 
Entered as second-class matter, August 19, 1943, at the post office at Lancaster, Pa., under the act of March 3, 1879. 
Acceptance for mailing at the special rate of postage provided for in paragraph (d-2), Section 34.40, P. L. & R. 


of 1948, authorized October 10, 1947. 


© 1956 by the American Psychological Association, Inc. 





Journal of Applied Psychology 








VoL. 40, No. 6 


DECEMBER, 1956 








A Test of the Effects of Pregnenolone Methyl Ether on Subjective 
Feelings of B-29 Crews After a Twelve-Hour Mission * 


Saul B. Sells, John R. Barry, David K. Trites, and Herman I. Chinn * 


Air University, USAF School of Aviation Medicine, Randolph Field 


Pregnenolone methyl ether (PME) has 
been reported by Huffman and his associates 
(1, 5) to produce favorable effects on psychi- 
atric patients in mitigating subjective reac- 
tions of fatigue, irritability, anxiety, and fear. 
Campbell et al. state that (with doses vary- 
ing from:125 to 250 mg. given from one to 
four times daily), 


The patients with fair consistency reported an al- 
most immediate feeling of relaxation following the 
administration of a therapeutic dose. They reported 
that they could sleep with greater ease and that they 
felt more rested on the following day. There did 
not appear to be any significant over-accentuations 
of euphoria such as does occur with the use of 
amphetamine. A most interesting observation was 
that cases combining severe depression symptoms 
with insomnia showed a considerable relief from 
both, and we have encountered no other single 
medication which has produced this result (1). 


More recently Sleeper (5) reported on a 
series of 150 private patients, over a period 


1 The cooperation and assistance of many persons 
made this research possible. Appreciation is ex- 
pressed to Colonel Colin E. Anderson, Commander, 
3510th Flying Training Group, for his permission to 
carry out the study with student crews and for his 
wholehearted encouragement. Captain Emil Chapla 
and Major Ben Weeks of the 3510th Flying Train- 
ing Group assisted substantially by scheduling the 
briefings and test sessions and advising on adminis- 
trative arrangements. Major Joseph Quashnock, De- 
partment of Flight Medicine, School of Aviation 
Medicine, USAF, provided medical consultation and 
attended all testing sessions. The following personnel 
of the School of Aviation Medicine assisted on tech- 
nical phases of the study: Major T. C. Kahn, Cap- 
tain M. R. Seaquist, Dr. Albert Kubala, T/Sgt. 
Thomas Putnam, S/Sgt. Charles F. Eckel, A/1C 
Robert Laves, A/1C George L. Sheldon, A/1C Gary 
Walkup, A/2C Wayne Fowler, and A/2C Raphael 
Dondero. 

2 The first three authors are in the Department of 
Clinical Psychology, the fourth in the Department 
of Biochemistry and Pharmacology. 


of almost two years, who were given daily 
divided dosage levels of 150 to 500 mgm. 
in a digestible oil solution of 25 mgm./cc. 
strength. He, too, found that the only fre- 
quent alteration in patients taking this steroid 
was a decrease in irritability and anxiety. 
This change seemed to occur within a limit 
characteristic of each patient, and increasing 
the dosage level produced no great change 
after the limit of improvement was reached. 
The two patient groups for which Sleeper 
found most favorable results were involu- 
tional patients with anxiety and depression, 
and psychoneurotic patients who had become 
stable with passage of time, but remained 
uncomfortable. No significant side effects, 
except an occasional mild rash or mild nausea, 
possibly due to the oil carrier, were found. 

Operational flying missions of long dura- 
tion, with concomitant exposure to hazard, 
physical discomfort, and deprivation, fre- 
quently involve subjective emotional reactions 
which may cause or be accompanied by re- 
duced capacity for accurate efficient perform- 
ance. If efficiency could be increased by 
medication which altered neurophysiological 
balances and thus mitigated undesirable sub- 
jective reactions, it would be important to the 
Air Force. 

The experiment described below was un- 
dertaken to determine whether similar effects 
might be produced in normal bomber crew 
personnel, under realistic operational flying 
conditions. The tests were arranged in con- 
junction with a long overwater training mis- 
sion involving between 15 and 18 hours of 
continuous activity, 12 hours of which were 
in flight. The drug was administered after 


353 





354 


the completion of the initial psychological 
testing about an hour after landing. The ef- 
fects of the drug were assessed later after the 
crews unloaded ‘and checked their aircraft 
and completed a postflight inspection and 
critique. 


Method 
Plan of the Experiment 


Eight student B-29 crews and their instructor 
crews volunteered to serve as subjects in this ex- 
periment. Each crew, including instructors, was nor- 
mally composed of 15 individuals. The effects of 
the drug were tested in relation to the anxiety, fa- 
tigue, and irritability incident to a long, overwater 
training mission involving navigation, bombing, and 
gunnery problems, which is considered by the train- 
ing group to be extremely tiring and stressful. 

Since the side effects of the drug were unknown, 
it was decided to conduct the first test on the 
ground, at the end of the mission, when the effects 
of the long flight and loss of sleep would be height- 
ened and while several terminal activities integral to 
the completion of the mission were yet to be per- 
formed. This procedure provided a good situation 
for the experiment, involving realism in the use of 
operational personnel, and required only 90 minutes 
of additional time for drug administration and psy- 
chological testing. 

The entire group of 120 crew members and in- 
structors was sorted into two subgroups according 
to a plan which provided for equal representation in 
each subgroup of each crew and each crew position. 
Subgroup E, designated as the experimental group, 
was administered the drug. Subgroup C, designated 
as the control group, was administered a placebo. 
The subjects were told that they might receive dif- 
ferent dosages, but no subject knew whether he was 
being given the drug or the placebo. Two hours 
after take-off, crew 2 was forced to abort the mis- 
sion and returned to base. An unexpected variation 
in the size of the crews (13 to 16 men instead of 
15), together with the loss of this crew and the fail- 
ure of three instructors to appear for testing, re- 
duced the final sample for the experiment to 50 
control and 51 experimental subjects. Subsequently, 
it was found that all of the test results were invalid 
for one control subject who had not followed the 
instructions, and that for one test two experimental 
subjects had invalid scores. 

The eight crews took off at 15-minute intervals, 
beginning at 2010 hours on April 26, 1954. Prior to 
take-off each crew had been at the flight line for 
three hours for briefing, loading, and preflight in- 
spection. The seven crews completing the mission 
landed the next morning between 0730 and 0930 
hours. 

As soon as each crew landed, it taxied the aircraft 
to a parking area and reported directly to an as- 


Saul B. Sells, John R. Barry, David K. Trites, and Herman I. Chinn 


signed briefing room where a 45-minute battery of 
psychological tests was administered. Upon comple- 
tion of the testing, each crew member was given 
either the drug or the placebo. Smoking was per- 
mitted, but none of the subjects was permitted to 
eat until after the experiment was concluded. 

After administration of the drug, the crews re- 
turned to their aircraft for unloading and a post- 
flight inspection. This was expected to require at 
least an hour, but because of a heavy rain several 
of the crews were unable to display their gear for 
inspection; the unloading procedure was, therefore, 
quite rapid. The crews then returned to the briefing 
rooms for a postflight critique with their instructors. 
Upon completion of the critiques, the psychological 
tests were repeated, and the crews were dismissed. 

The following morning at 0930 the entire group 
was assembled in a large briefing room and given a 
follow-up questionnaire. 

The psychological tests administered after the drug 
had been taken were expected to reflect the feelings 
of the subjects at that time. Large mean differences 
in the predicted direction between the test scores of 
the experimental and control groups would have sug- 
gested that the drug was effective. The same tests 
were administered before the drug was taken so that 
each subject’s pre-drug feelings might be evaluated. 
This also permitted statistical control of the initial 
differences between the experimental and control 
groups which might have masked subsequent dif- 
ferences attributable to the drug.* 


Drug Dosage and Administration 


Huffman and his associates administered pregneno- 
lone methyl ether (PME) to patients in doses of 
160 to 320 mg. from one to four times per day. 
The average daily dosage was 500 mg. To facilitate 
absorption, Huffman prepared the PME as a solu- 
tion of 25 mg./ml. in coconut oil plus a commercial 
emulsifier. This was administered in water. 

For the purpose of Air Force application, it was 
considered important to administer the drug in a 
single dose, of high potency, which would have rapid 
effects. Accordingly, in the present experiment, 800 
mg., prepared as described above, were administered. 

After swallowing the oil, the subjects were given 
a glass of orangeade and a stick of chewing gum. 
The orange juice and chewing gum effectively offset 
the oily taste, and virtually all the subjects ac- 
cepted the dose without comment. 


Psychological Tests 


Since the drug was reported to influence subjec- 
tive feelings of irritability, anxiety, and fatigue, the 


8 On the assumption of equal Ns, variances, and 
covariances for the group, this design is superior to 
the use of post-drug test scores only when the cor- 
relation between pre- and post-drug test scores is 
above .50. This was the case in the present ex- 
periment. 





Effects of Pregnenolone Methyl Ether 


psychological test battery assembled for this experi- 
ment was designed to measure manifest affective re- 
actions of these kinds. The question of possible side 
effects, such as the impairment of perceptual skills, 
judgment, reasoning, memory, and intellectual and 
psychomotor skills, was deferred pending confirma- 
tion of the earlier clinical observations of Campbell 
et al. (1). The experiment was planned with the 
assumption that the affective effects should be evalu- 
ated first, since this could be done more rapidly and 
efficiently, and that evaluation of possible deleterious 
side effects should be studied only if positive find- 
ings were obtained on the affective tests. Affective 
relief or improvement can be accomplished by in- 
tellectual or motor impairment (as in the case of 
alcohol), but it is unlikely for intellectual or motor 
improvement to occur without some affective feed- 
back to the subjects. 

The psychological tests are described in detail else- 
where (4). They were designed to obtain a meas- 
ure of the subjects’ subjective feelings “at this time” 
and to be capable of measuring changes over brief 
periods. The battery of six tests includes a percep- 
tual test, scored for threatening objects perceived; 
an adjective check list containing 35 pairs of self- 
description adjectives, an annoyance test, a question- 
naire composed of two anxiety scales, a controlled 
word-association test, and an attitude scale. Twelve 
separate scores were computed from these six tests, 
and the following effects of PME on them were ex- 
pected: 


355 


Expected 
effect 
of PME 


Increase 


Test scores 
. Total objects seen (test 1) 
. Percent of threatening objects seen 
(test 1) 
. Depressed affect (adjective) score 
(test 2) 
. Annoyance score (test 3) 


Decrease 


Decrease 

Decrease 

. Annoyance add score (test 3) Decrease 

. Number of annoyance add items 
(test 3) 

. Taylor scale of manifest anxiety 
(test 4) 

. MMPI “Lie” scale (test 4) 

. SAM manifest anxiety score (test 4) 


Decrease 


Decrease 

No change 

Decrease 

. Cornell Word Form (test 5) Decrease 

. Tendency to agree (test 6) 

. Unwillingness to admit common 
frailties (test 6) 


Decrease 


Increase 


The follow-up questionnaire, administered the fol- 
lowing day, covered the following items: (a) how 
long the subject remained awake after leaving the 
hangar; (b) whether he found it more or less diffi- 
cult, or no difference noted, to fall asleep; (c) what 
changes in feelings were noticed at any time up -to 
the meeting where the questionnaire was given; (d) 
a check list of adjectives, such as “tired,” “relaxed,” 
“depressed,” “irritable,” etc.; (€) an open-end ques- 
tion requesting additional comments. 


Table 1 


Pre-Drug and Post-Drug Mean Test Scores of Control Group and Experimental Group and the Mean 
Post-Drug Scores of the Two Groups Adjusted for Differences in Pre-Drug Means * 








Experimental 


Test asst 
No. Score Title Pre 
14.02** 

.23 
7.47** 
94.53** 
15.74** 
4.57° 
8.27** 
4.184 
7.84** 
4.37* 

165.86 


Total Objects Seen 

% Threatening Objects 

Depressed Affect Score 

Annoyance Score 

Annoyance Add Score 

No. of Annoying Add Items 

Taylor Anxiety Score 

MMPI “Lie” Score 

SAM Manifest Anxiety Score 

Cornell Word Form 

Tendency to Agree 

Willingness to Admit 
Common Frailties 


.25 
4.29 
89.86 
11.57 
3.88 
7.10 
4.16 
5.92 
3.73 
167.51 


QAAnN ES ERWWWwNnK S| 


61.31° 62.37 


Post 


19.76 . 


Control 


Post WN 
14.47** 21.92 49 
22° 21 49 
iso" 445 49 
92.67** 88.78 
19.67** 12.49 
3.82°° 3.88 
8.47* 7.61 
3.924 3.98 
8.10** 6.39 
4.31* 3.71 
165.45 167.27 


Adjusted» 


Exp. Control 
19.99 21.67 
.25 a 
4.34° 4.40 
89.12* 89.56 
13.16 10.84 
4.31 3.43 
7.19¢ 7.52 
4.06 4.08 
6.00° 6.31 
3.37¢ 3.41 
167.51 167.42 





N 


51 
51 
49 
51 
51 
51 
51 
51 
51 
Si 
51 


Pre 


49 
49 
49 
49 
49 
49 
49 
51 61.24° 62.43 49 


62.37 62.46 





* Means were adjusted by method suggested by McNemar (3, p. 328). 
b None of the differences between adjusted means is significant at less than the .05 level. 
¢ Differences between the means are in the predicted direction. 


4 Predicted not to shift. 


* Difference between means is in predicted direction and significant at less than the .05 level. 
** Difference between means is in predicted direction and significant at less than the .01 level. 





356 


Saul B. Sells, John R. Barry, David K. Trites, and Herman I. Chinn 


Table 2 


Comparison of Experimental and Control Groups on Follow-Up Questionnaire 








Variable 


Control 


Experimental 
(43 cases) 


(48 cases) 


Signifi- 
cance 





Mean number of minutes awake after 
leaving hangar 

More difficulty falling asleep 

No difference in falling asleep 

Less difficulty falling asleep 

Number of favorable feeling items circled 

Number of favorable additional comments 


189.8 ” 
0 
29 
14 


5.92 
3.08 





* T test not significant. 
** ¥2 test not significant. 


Results 


The results for the twelve test scores are 
summarized in Table 1. The mean raw 
scores on the Minnesota Multiphasic Person- 
ality Inventory “Lie” scale are 4.18 and 4.16, 
respectively, for the pre- and post-drug ad- 
ministrations for the experimental group and 
3.92 and 3.98, respectively, for the control 
group. These mean scores cluster closely 
around the 50th percentile of adult males and 
indicate appropriate test-taking attitudes on 
the part of the subjects. The ranges of “Lie” 
scores for both groups are within normal 
limits. 

A comparison of the pre- and post-drug 
mean test scores for the experimental group 
indicates that nine of the 11 mean scores * 
shift in the predicted direction after adminis- 
tration of the drug. Of these, six means 
shifted to an extent significant at the .01 
confidence level and one to an extent signifi- 
cant at the .05 level. It may be inferred that 
the shifts in mean scores are consistent with 
the hypothesis that the putative effects of 
PME are operative; however, the same gen- 
eral effects are found for the control group 
with 10 of the 11 mean score shifts from 
pre-drug to post-drug administration in the 
predicted direction. Of these 6 are signifi- 
cant at the .01 level of confidence and 2 at 
the .05 level. Furthermore, none of the dif- 
ferences between the experimental and con- 
trol groups, in adjusted® post-drug mean 
scores, approximates the .05 confidence level. 
“we MMPI “Lie” scale, predicted not to 
shit. 


5 Mean scores were adjusted by method suggested 
by McNemar (3, p. 328). 


These results indicate that changes in feel- 
ings and affect are experienced by the crews 
between the time immediately after landing 
and one to two hours later. These changes 
have been noted frequently by crew mem- 
bers and may represent release from tension, 
relaxation, and change of task from flying to 
whatever they may be doing on the ground. 
Since no increment of improvement was found 
in the experimental group over the control 
group, it is necessary to reject the effects of 
the drug as a factor contributing to the 
changes observed. 

In Table 2 the results from the follow-up 
questionnaire are summarized. No significant * 
differences were found, one day later, in re- 
ports made by experimental and control group 
crew members on the number of minutes they 
remained awake after leaving the hangar, 
difficulty in falling asleep, or general affect. 
While there may not have been sufficient time 
allowed in this study for the absorption of 
the drug, this was not indicated by the re- 
sults of the follow-up. 

Because of the variation in time between 
drug administration and post-drug testing, as 
explained earlier, the pre- and _ post-drug 
scores were compared for the two crews with 
the shortest elapsed time (less than 63 min- 
utes) and for the two crews with the longest 
(greater than 94 minutes). Although the 
trends observed in these data were consistent 
with those obtained for the total sample, 
there were no systematic differences which 
could be attributed to the differences in 
elapsed time. 

An additional interesting finding is that in- 
creased intestinal motility on the day of the 





Effects of Pregnenolone Methyl Ether 


experiment was reported by 17 subjects on 
the follow-up questionnaire. Most of the 
subjects attributed this motility to the oily 
emulsion in which the drug was administered. 
Of this group, nine were from the experimen- 
tal group and eight were from the control 
group. 


Conclusion 


Evidence obtained from this experiment 
does not support the claims made for PME 
by Huffman and his associates. Admittedly, 
there are significant differences between the 
conditions of the present experiment and the 
clinical conditions in which Huffman’s ob- 
servations were made. The subjects of this 
investigation were healthy, robust, active fly- 
ers, observed after a fatiguing and stressful 
training mission, while Huffman’s subjects 
were seriously disturbed mental patients. 
Concentrated dosage was used in the present 
study, while the dosage in Huffman’s study 
was diffused over a longer period of time. 
Whether or not an extended period of medi- 
cation would have been more effective is a 
matter of speculation. If such a regimen is 
required, however, it would seriously limit 
usefulness of the drug among crew personnel. 
Finally, the present evaluation of the drug is 
based on rigorous objective testing, quantita- 
tively evaluated, while Huffman’s impressions 
are subjective and clinical. Although the 


357 


conditions of the present experiment did pro- 
duce severe subjective feelings of irritability, 
depression, and anxiety among the crews, and 
the changes observed through the test bat- 
tery did reflect improvement during the hours 
after landing, the subjects who received PME 
did not improve to a greater extent than did 
those given the placebo. This study does not 
support the use of PME for the alleviation of 
depression, irritability, and anxiety feelings 
of crew members. 


Received December 19, 1955. 


References 


. Campbell, C. H., Huffman, M. N., & Sleeper, 
H. G. The effects of pregnenolone methyl 
ether in psychiatric patients. Paper presented 
before Mid-Continent Psychiatric Association, 
Kansas City, Mo., Sept. 1953. 

. Goldfain, E., & Huffman, M. N. A study of the 
action of pregnenolone methyl ether in pa- 
tients with rheumatoid arthritis. Acta med. 
Scandinav., 1954, 147, Parts 5-6. Pp. 455- 
458, 

. McNemar,Q. Psychological statistics. 
Wiley, 1949. 

. Sells, S. B., Barry, J. R., Trites, D. K., & Chinn, 
H. I. A test of the effects of pregnenolone 
methyl ether on subjective feelings of B-29 
crews after a twelve-hour mission. USAF 
Sch. Aviat. Med. Rep., 1955 (Rep. No. 55-11). 

. Sleeper, H. G. Experimental use of pregnenolone 
methyl ether in treating psychiatric symptoms. 
Dis. Nerv. System, 1955, 16, 93-94. 


New York: 





Journal of Applied Psychology 
Vol. 40, No. 6, 1956 


The Effect of Scale Interval Length and Pointer Clearance on 
Speed and Accuracy of Interpolation *’ 


A. V. Churchill 


Defence Research Medical Laboratories, Toronto, Canada 


A number of studies have been reported on 
the effect of scale-interval length on inter- 
polation accuracy. Reports by Grether and 
Williams (2, 3), Kappauf and Smith (4, 5), 
and Leyzorek (6) indicate that the accuracy 
of interpolation between scale marks is de- 
pendent upon the separation of the marks. 
These studies agree that the best interval for 
interpolation lies between 0.5 and 1.0 inch, 
at normal reading distances. The displays 
studied were composed of more than one scale 
interval and thus the subjects’ task involved 
both scale reading and interpolation. 

No reports on the effect of pointer clear- 
_ ance (the distance between the pointer tip 
and the scale mark, in the plane of the scale) 
and interpolation have been uncovered. On 
the accuracy of reading straight scales to the 
nearest scale mark, Vernon (7) has reported 
that “up to a clearance of 0.7 inch errors do 
not increase appreciably.” For curved scales 
Woodson (8, pp. 1, 8) has recommended a 
maximum clearance of jg inch between 
pointer and scale mark. 

The experiment reported here was designed 
to reveal the effect of interval length and 
pointer clearance on the speed and accuracy 
of interpolating to tenths of a scale interval. 
The data also disclosed systematic changes in 
the direction of errors and relationships be- 
tween accuracy of initial and subsequent re- 
sponses and between orders of presentation of 
scale intervals. 

Method 
Apparatus 

Single-scale intervals were used in order to elimi- 

nate the effects of scale reading as such. The inter- 


1 Defence Research Medical Laboratories Report 
No. 164-4, Project No. D77-94-20-27, H.R. No. 125. 

2A set of 18 data and analysis tables has been de- 
posited with the American Documentation Institute. 
Order Document No. 5041 from ADI Auxiliary Pub- 
lications Project, Photoduplication Service, Library 
of Congress, Washington 25, D. C., remitting in ad- 
vance $1.75 for microfilm or $2.50 for photocopies. 
Make checks payable to Chief, Photoduplication 
Service, Library of Congress. 


vals were 0.25, 0.50, 0.75, 1.0, 1.5, and 2.0 inches. 
Each interval consisted of a horizontal reference line 
0.03 inch thick, with scale marks 0.03 inch thick and 
0.20 inch long at the extremities. They were drawn 
in black on individual white cards. The pointer was 
0.03 inch wide at the tip, and was adjustable to give 
clearances of 0.0, 0.125, 0.25, 0.50, 1.0, and 2.0 inches 
between the pointer tip and the scale reference line. 

The S viewed the display through an aperture. A 
shutter was placed between the aperture and the dis- 
play. A chin rest served to maintain a constant 28- 
inch viewing distance, and controlled eye level so 
that approximately one inch of the pointer was 
visible for all presentations. Displays were at right 
angles to the line of sight. The scales which the 
experimenter used for setting the pointer were ex- 
panded to a seven-to-one ratio and marked off in 
tenths. 

In the first half of the experiment the shutter was 
opened by the experimenter. The S ended the ex- 
posure by pressing a microswitch. Exposure time was 
measured by an electric timer which was operated 
by the shutter. In the second half of the experi- 
ment the shutter was operated by an interval timer, 
giving a 0.3-second exposure. 

Throughout both parts of the experiment the illu- 
mination was 180 footcandles at the display, as 
measured by the Macbeth illuminometer. 


Procedure 


Ten laboratory employees served as Ss. Each S$ 
was presented with the six scale intervals in random 
order. The six pointer clearances were presented in 
random order for each interval. The nine inter- 
polated pointer positions were presented twice under 
each of the 36 conditions. Each series of 18 settings 
was randomized. 

The S was instructed in the task and shown a 
sample display. Readings were made under “speed 
and accuracy” instructions, and each exposure was 
preceded by a “Ready” signal. 


Results 


Since the results show identical trends 
whether considering error frequency or mag- 
nitude, the tabulations and analyses presented 
here are based on error frequencies. The 
tabulations based on error magnitude are not 
included in this report.® 

Table 1 shows the total error frequencies 


3 Tables 13-20; see footnote 2. 


358 





Scale Interval Length and Pointer Clearance 


Table 1 


The Effect of Scale Interval Length and Pointer 
Clearance on the Frequency of 
Interpolation Errors 


(Subject-controlled exposure time) 





Pointer Scale Interval Length (inches) 
Clearance. ——-——---———_—————— 
(inches) 0.25 0.50 0.75 1.0 1.5 2.0 Totalf 


0.0 7, 3B 8&8 3B BB 223 
0.125 28 36 24 210 
53 41 28 253 
644 SO 27 274 
74 #41 26 300 
79 60 49 365 


Totalf 552 337 271 179 





for 10 Ss under S-controlled exposure time. 
One hundred eighty readings were made un- 
der each of the 36 conditions. 

The data from Tabie 1 were transformed 
to satisfy the assumptions of analysis of 
variance (10), y=2 sin? \/% error, (9), 
and an analysis of variance was performed 
on the transformed data. The results of the 
analysis are shown in Table 2. 

From the analysis it will be seen that the 
decrease in errors as pointer clearance is re- 
duced is significant at the .01 level. The 
deviations from regression are not significant. 


Table 2 


Analysis of Variance of Error Data Presented 
in Table 1 


(Data from Table 1 transformed: 
y = 2sin1 ¥% error) 








Mean 


Source df Square 





Among pointer clearances 
due to regression 
deviations from 

regression 

Among interval lengths 
due to regression 
deviations from 

regression 

Error 


4261 


.0060 


3.1203 


0393 
.0090 


Total 





** Significant at the .01 level. 





PART | 
ERROR - SIL 
PC e 
TIME - SIL --- 
PC. --- 
PART II 
ERROR - SIL 
PC 


b 
3 


8 


“ERROR FREQUENCY 








AVERAGE TIME PER GROUP OF 18 READINGS (SECS) 





25 50 75 10 15 20 
SCALE INTERVAL LENGTH (INS) 

20 10 50 25 125 O 

POINTER CLEARANCE (INS) 


Fic. 1. The effect of scale-interval length (S.I.L.) 
and pointer clearance (P.C.) on interpolation time 
and error. 


The decrease in errors as the scale-interval 
length is increased is significant at the .01 
level. The deviations from regression indi- 
cate that heterogeneity is still present. 

The mean times for each condition under 
S-controlled exposure time were tabulated 
and analyzed.* The results of the analysis 
of variance show the same relationships as 
presented in Table 2, significant at the .01 
level. 

The error frequencies for controlled (0.3 
sec.) exposure time were tabulated and ana- 
lyzed.’ Analysis of variance shows the same 
results as the analysis of the S-controlled ex- 
posure time data, significant at the .01 level. 

For the pointer clearance time and error 
data the line fitted is y= b, + b,x: where 
y=2 sin \/%error/6 and x = (pointer 
clearance X 8). The curvilinear relationship 
between time and error and interval length 
is represented by the function y = b, + b; 


4 Tables 3-4; see footnote 2. 
5 Tables 5-6; see footnote 2. 








360 


log x where y = 2 sin \/% error/6 and x = 
(scale interval length x 4). The heterogene- 
ity indicated by the significance of the devia- 
tions from regression is not constant. These 
relationships may be seen more clearly in the 
distributions of time and error which are pre- 
sented in Fig. 1. 

The curves presented in Fig. 1 demon- 
strate the effect of scale-interval length and 
pointer clearance on reading time and error. 
As will be seen from the graph, time and 
error decrease as the scale-interval length is 
increased, and as pointer clearance is de- 
creased. There is a general trend toward a 
change in the curves at the 2-inch interval 
length and the zero pointer clearance. 


Discussion 


During the administration of the experi- 
ment and the tabulation of the data a num- 
ber of interesting relationships were observed. 

1. It was noted that there tended to be 
more errors in the remaining responses to a 
group of settings if the initial response was 
incorrect. Errors were tabulated ° for the re- 
maining responses to groups of settings in 
which the initial response was correct or in- 
correct. The results showed that signifi- 
cantly more errors were made when the initial 
response was incorrect. This relationship was 
not characteristic of individual subjects, ini- 
tial pointer positions, scale intervals, or 
pointer clearances. 

2. The experiment was designed with the 
order of presentation of scale intervals ran- 
domized. While tabulating the data it was 
observed that the short scale intervals ap- 
peared to have an adverse effect on longer 
intervals when they preceded the longer in- 
tervals. Errors were tabulated for scale in- 
tervals for the various presentation positions,’ 
and the tabulations revealed a tendency to- 
ward more errors on the longer scale inter- 
vals when they followed the shorter intervals 
than when they preceded the shorter intervals. 

3. The tabulation of the data showed an 
apparent relationship between the interval 
length and the direction of errors—toward 


6 Tables 7-10; see footnote 2. 
7 Table 11; see footnote 2. 





A. V. Churchill 


the interval extremes and toward the interval 
mid-point (i.e., extremes, 1 and 9: mid-point, 
5). The data were tabulated in terms of the 
direction of error for interval lengths and 
pointer clearances. Ratios of errors toward 
the extremes to errors toward the mid-point 
were calculated. This ratio is large at the 
0.25-inch interval (a mean of 2.74 for the 
two parts of the experiment), diminishes as 
the interval is lengthened up to 1.0 inch (a 
mean of 1.22), then drops below one—i.e., 
the majority of errors tend toward the mid- 
point—giving a mean ratio of .39 at the 2.0- 
inch interval. 

When comparing direction of errors with 
pointer clearances it was found that the ratio 
is large at zero clearance (a mean of 3.93), 
diminishes as the clearance is increased up to 
1.0 inch (a mean of 1.08), then drops below 
one at the 2.0-inch clearance (a mean of .76). 
Carr and Garner (1) noted a similar change 
in the direction of errors when scale intervals 
ranging from 0.5 to 25 mm. were interpolated 
in one-hundredths. 


Summary 


1. Reading time and errors of interpolation 
decrease significantly as the scale-interval 
length is increased from 0.25 to 1.5 inches, 
with no improvement at the 2.0-inch interval 
length. 

2. Reading time and errors of interpola- 
tion decrease significantly as the pointer 
clearance is reduced from 2.0 to 0.125 inches, 
with no improvement at zero clearance. 

3. If the response to the first reading of a 
group is incorrect, there is a tendency to- 
ward more errors on the remaining readings 
in that group than there are when the initial 
response is correct. 

4. There is a tendency toward increased 
errors on a scale interval of a given length if 
it is preceded by a shorter scale interval. 

5. The majority of errors tend toward the 
interval extremes on the short scale intervals 
and pointer clearances, and toward the inter- 
val mid-point on the long scale intervals and 
pointer clearances. One inch appears to be 


8 Table 12; see footnote 2. 





Scale Interval Length and Pointer Clearance 361 


the transition point for both scale-interval 
length and pointer clearance. 


Received March 15, 1956. 


References 


1. Carr, W. J., & Garner, W. R. The maximum 
precision of reading fine scales. J. Psychol., 
1952, 34, 85-94. 

2. Grether, W. F., & Williams, A. C., Jr. Speed 
and accuracy of dial reading as a function of 
dial diameter and spacing of scale divisions. 
USAF Air Materiel Command, Engng. Div., 
Aero Med. Lab Memo Rep., 1947, No. 
TSEAA-694-1E. 

3. Grether, W. F., & Williams, A. C., Jr. Psycho- 
logical factors in instrument reading. II. The 
accuracy of pointer position interpolation as 
a function of the distance between scale marks 
and illumination. J. appl. Psychol., 1949, 33, 
594-604. 

4. Kappauf, W. E., & Smith, W. M. Design of in- 
strument dials for maximum legibility. II. A 


preliminary experiment on dial size and gradu- 
ation. Dayton, O.: USAF Air Materiel Com- 
mand, Wright-Patterson AFB, 1948. (AF 
Tech. Rep. No. 5914, Pt. 2.) 


. Kappauf, W. E., & Smith, W. M. Design of in- 


strument dials for maximum legibility. IV. 
Dial graduation, scale range and dial size as 
factors affecting the speed and accuracy of 
scale reading. Dayton, O.: USAF Air Ma- 
teriel Command, Wright-Patterson AFB, 1950. 
(AF Tech. Rep. No. 5914, Pt. 4.) 


. Leyzorek, M. Accuracy of visual interpolation 


between circular scale markers as a function 
of the separation between markers. J. exp. 
Psychol., 1949, 39, 270-279. 


. Vernon, M. D. Scale and dial reading. Med. 


Res. Council, Unit appl. Psychol. (Cam- 
bridge) Rep., 1946, No. MRC/APUR 49. 
Woodson, W. E. Human engineering guide for 
equipment designers. San Diego, Calif.: 

U. S. Navy Electronics Laboratory, 1954. 


. Hald, A. Statistical tables and formulas. New 


York: Wiley, 1952. 


. Quenouille, M. H. Introductory statistics. Lon- 


don: Butterworth-Springer, 1950. 





Journal of Applied Psychology 
Vol. 40, No. 6, 1956 


Transfer of Training Between Quickened and Unquickened 
Tracking Systems * 


James G. Holland and Jean B. Henson 


Naval Research Laboratory 


Tracking systems differ widely in the effect 
that movement of the control has upon the 
displayed tracking error. Some systems have 
a tight display-control relationship and pro- 
vide a positional displacement of the target 
proportional to the positional displacement of 
the stick. Many other conventional systems 
provide an acceleration of the displayed tar- 
get for a positional input at the control. 
Such systems have a loose display-control re- 
lationship and require the operator to antici- 
pate the effects of his response and to make 
a countermovement after a given response 
but before the displayed error reaches zero 
in order to avoid overshooting. Birmingham 
and Taylor (1), however, have demonstrated 
that when position and velocity information 
are added to the display in the proper propor- 
tions, the operator is no longer required to 
anticipate the results of his previous move- 
ments of the control. Instead, he has only 
to make his immediate movements propor- 
tional to the displayed error. Since the op- 
erator has instantaneous knowledge of re- 
sults, the system is said to be quickened as 
opposed to the unquickened system, in which 
knowledge of results is delayed. 

Before quickening is adopted in any prac- 
tical situation, it is desirable to determine 
(a) if operators trained on an unquickened 
system will experience habit interference when 
forced to use a quickened system, (0) if their 
initial performance on the quickened system 
will suffer seriously when compared with their 
typical performance on the unquickened sys- 
tem with which they are proficient, and (c) 
if operators experienced with a quickened sys- 


1The opinions or assertions contained herein are 
the private ones of the writers and are not to be 
construed as official or reflecting the views of the 
Navy Department or the naval services at large. 

The material discussed in this paper has previ- 
ously been presented as NRL Technical Report No. 
4703. 


362 


tem will be handicapped when they have to 
switch to an unquickened system. 

Much research (3) has shown a relation- 
ship between similarity of stimuli and the 
transfer of training. When a new response is 
learned to an old or similar stimulus, nega- 
tive transfer is expected due to the inter- 
ference provided by the arousal of the old, 
and now erroneous, response. This condition 
might appear to prevail between a quickened 
and an unquickened task. To the naive op- 
erator the displays might seem to be very 
much alike, since in both cases only the track- 
ing error is presented. When an operator has 
been trained on a quickened system, his re- 
sponse should come to be directly propor- 
tional to the displayed error; but if switched 
to the unquickened system, proportioning the 
responses to the displayed error would be 
detrimental. Thus, negative transfer might 
be expected when an operator trained on a 
quickened system is transferred to an un- 
quickened system, or vice versa. 


Method 


Apparatus. The tracking task was of the com- 
pensatory type. The S was required to keep a dot 
on a cathodeé-ray tube centered on a hairline by 
manipulating a joy stick. The dot was free to move 
only in the horizontal plane. The dot was forced 
off the hairline by a sine wave of three cycles per 
min. generated by an analog computer. Despite the 
regularity of the course, the displayed error ap- 
peared erratic to S since it was the difference be- 
tween the course and the control output. To pre- 
vent S from anticipating the initial direction of 
movement, the polarity of the course was reversed 
randomly between trials. 

The control was a spring-restrained joy stick. 
Movement of the stick deflected the plate of a 
vacuum-tube strain gauge providing a voltage pro- 
portional to the stick deflection. In the unquick- 
ened system this voltage was fed through two inte- 
grators of an analog computer and then combined 
with the course and fed into the display (Fig. 1). 
Thus, a deflection of the stick resulted in an ac- 
celeration of the dot on the scope. A stick deflec- 





Quickened and Unquickened Tracking Systems 


[ERROR ae 
INTEGRATOR 


7 _ Q \ u 
our 1 A my Brrr Bo 
——— roy 
——a 


Fic. 1. Simplified block diagram of tracking ap- 
paratus. Switch position Q provided the quickened 
system and switch position U provided the unquick- 
ened system. 


tion of 1 cm. resulted in an acceleration of 16 cm./ 
sec. in the dot. The quickened system had two 
feedforward loops—one around both integrators 
added position information, and the other around 
the second integrator added velocity information. 
A stick deflection of 1 cm. resulted in a displayed 
displacement of 2 cm., a velocity of 8 cm./sec., and 
an acceleration of 16 cm./sec.*. Thus, the relation 
of position, velocity, and acceleration was the 1:4:8 
relation which previous research (4) has shown to 
be optimal. 

Scoring was accomplished by using a device which 
averaged the displayed error on each trial without 
regard to sign. 

Procedure. Twenty-four naval enlisted men, hav- 
ing combined GCT and ARI scores of 120, served 
as Ss. They were divided into four groups of six 
Ss each. Two groups were trained on the unquick- 
ened system—one of these received 140 40-sec. learn- 


24 


363 


ing trials, and the other received 260 40-sec. learning 
trials. The remaining two groups were trained on 
the quickened system—one received 140 trials and 
the other received 260 trials. Ten trials were given 
in each experimental session. There were approxi- 
mately 40 sec. between trials within each session and 
a minimum of 20 min. between successive sessions. 

After an S had completed his training trials he 
was switched to the other system, ie., an S trained 
with the unquickened system was tested on the 
quickened system and one trained with the quick- 
ened system was tested on the unquickened system. 
Transfer of training was evaluated by comparing 
the performance during the initial test session with 
the first training session of the 12 Ss who originally 
were trained on the system in question. 

After all Ss had been given 80 40-sec. trials on 
the test condition, they were switched back to the 
system with which they were originally trained. 
This was done to ascertain the extent of interfer- 
ence provided by the intervening practice with a 
different system. 


Results 


The results are summarized in Fig. 2. 
Plotted on the abscissa are successive ses- 
sions—each session containing ten 40-sec. 
trials. On the ordinate are plotted average 





TRANSFER TRIAL 


(LOW TRAINING ) og, | 


AVERAGE INTEGRATED ERROR 








| 


TRAINING CONDITION 
UNQUICKENED HGH @————e 
UNQUICKENED LOW o———__o 

QUICKENED HIGH @ 
QUICKENED LOW 0------0 


TRANSFER TRIAL 
(HIGH TRAINING) —, | 








| | 





10 15 


SESSIONS 


Fic. 2. 
tion of successive sessions. 


20 25 30 35 


Average integrated error scores, in arbitrary units, for the four experimental conditions as a func- 
Each point represents the mean of ten trials for each of six Ss. 





364 


integrated error scores in arbitrary units. 
Each point represents a mean of ten trials for 
six Ss. The broken-line curves represent the 
performance of the two groups trained initi- 
ally on the quickened system, while the solid- 
line curves represent the performance of the 
two groups trained initially on the unquick- 
ened system. The black circles represent 
groups with the greater amount of training 
(i.e., 26 sessions); and the white circles rep- 
resent groups with the lesser amount of train- 
ing (ie., 14 sessions). The vertical lines 
mark the points at which the Ss switched sys- 
tems (session 27 for the high-trained groups 
and session 15 for the low-trained groups). 
The Ss trained on the unquickened system at 
these points began using the quickened sys- 
tem, while those trained on the quickened 
system began using the unquickened system. 

The extent of the transfer can be seen 
by comparing the initial transfer session 
with the first session during the training pe- 
riod for the two groups beginning their train- 
ing with the system in question. When ses- 
sion 15 for the group with the low degree of 
training on the unquickened system is com- 
pared with the first session of the two groups 
trained on the quickened system, it is seen 
that positive transfer occurred. The mean 
integrated error score for the first training 
session of all Ss initially trained with the 
quickened system is 7.28, while the mean for 
the first transfer session of the group with 
the low degree of training on the unquick- 
ened system is 5.09. These points are sig- 
nificantly different at the .05 level, indicating 
that the transfer effect is different from zero. 
The transfer session is also significantly dif- 
ferent from the mean integrated error score 
of 3.94 found during session 14 for the quick- 


James G. Holland and Jean B. Henson 


Table 1 


Per Cent Transfer Scores for the Four 
Experimental Conditions 





Transfer Scores (Per Cent) 


Quickened to Unquickened 
Unquickened 


Degree of 
Training to Quickened 


Low s 64 
High § 46 





of 5.70 was obtained on the first transfer ses- 
sion as compared with 7.28 for Ss lacking the 
preceding experience with the unquickened 
display. However, these two points are not 
significantly different; therefore, it cannot be 
concluded that transfer was greater than zero. 
A comparison of the two transfer sessions is 
not significant either, so there is no basis for 
the conclusion that the differing amounts of 
training provide different degrees of transfer. 

In the case of switching from the quickened 
to the unquickened display, transfer is posi- 
tive for both groups; and in both cases the 
difference is statistically significant at the .05 
level. Also, in both cases the transfer is not 
complete. The average integrated error score 
for naive Ss is 18.30 as compared with 10.87 
for those having first had 14 sessions with 
the quickened system and 11.82 for those 
having first had 26 sessions with the quick- 
ened system. The transfer points for these 
two groups did not differ significantly from 
each other, so again there is no basis for con- 
cluding that the different extents of training 
result in differing amounts of transfer. 

Table 1 presents the percentage of trans- 
fer scores. These scores are obtained by the 
following formula: 


scores for inexperiences Ss — scores for transferred Ss 


> | 





scores for inexperienced Ss — scores at asymptote of learning 


ened groups. Thus, while significant positive 
transfer was obtained, the transfer was not 
complete. 

When the group receiving the greater 
amount of training on the unquickened con- 
dition was switched to the quickened condi- 
tion, there was again an indication of positive 
transfer. An average integrated error score 


Zero per cent transfer would mean the per- 
formance is the same on the transfer day as 
that obtained by inexperienced Ss. One hun- 
dred per cent transfer would mean that the 
score on the first transfer session is the same 
as the scores obtained at the asymptote of 
learning. For both there is somewhat less 
positive transfer for the higher amount of 








Quickened and Unquickened Tracking Systems 


training but these differences are not signifi- 
cant. 

After eight sessions on the transfer condi- 
tion Ss were switched back to their original 
training condition to answer an exploratory 
question regarding interference introduced by 
the intervening experience (Fig. 2). For all 
groups the performance is essentially at the 
same level as immediately before the inter- 
vening experience with the second display. 
This suggests that there may be no difficulty 
in switching back and forth between quick- 
ened and unquickened systems, provided both 
have been learned to some degree of pro- 
ficiency. 

Discussion 


The results of this experiment would thus 
seem to answer the questions posed. An op- 
erator experienced with unquickened systems 
should not be penalized in learning to operate 
a quickened system, nor should an operator 
trained with a quickened system experience 
difficulty in learning to use an unquickened 
system. Instead, in either case, he might 
receive some benefit from his previous ex- 
perience. However, since transfer is not com- 
plete, some training would probably be neces- 
sary before the operator would reach the full 
potential of the system. In this regard, it 
may be of considerable practical importance 
that much less training should be required in 
the case of switching to the quickened task. 

Results similar to the present study were 
obtained by Lincoln (2), using rather dif- 
ferent tracking situations. He investigated 
transfer of training among three different pur- 
suit tracking systems. In one system a step 
displacement of the control provided a ve- 
locity of the displayed cursor (i.e., an un- 
quickened velocity control); in a second 
system a step displacement of the control 
provided both a position and a velocity of 
the displayed cursor (i.e., a quickened ve- 
locity control); and in the third system a 
step displacement of the control provided 
simply a position of the cursor (i.e., a posi- 
tion control having no counterpart in the 
present study). He found various degrees of 
positive transfer in switching either from the 
unquickened control to the quickened con- 


365 


trol or from the quickened control to the un- 
quickened control. Thus, in his study, quick- 
ened and unquickened velocity controls for 
pursuit tracking provided resuits similar to 
those of the present study, which used quick- 
ened and unquickened acceleration controls 
in a compensatory tracking situation. 

Interestingly, Lincoln obtained negative 
transfer in cases of switching from either the 
quickened or unquickened controls to the po- 
sition control. It would be informative to 
investigate the nature of the transfer that 
would be obtained in switching from an ac- 
celeration control to a position control or 
from a control providing position, velocity, 
and acceleration to a position control. Un- 
fortunately, the present study provides no in- 
formation on this point. 

Earlier it was explained that negative trans- 
fer might be expected since the relationship 
between the displayed error and the appro- 
priate response appears to be so different for 
the two systems. However, positive rather 
than negative transfer was obtained. Since 
the previous literature has clearly established 
the relationship between transfer effect and 
both similarity of the stimuli and similarity 
of responses involved in the two tasks, the 
results of the present study raise some ques- 
tion concerning the psychological nature of 
the two tasks used here. There seem to be 
two possibilities. First, the stimuli used even 
by naive Ss may be very different for the two 
tasks. That is to say, the loose display-con- 
trol relationship in the unquickened system 
and the tight display-control relationship in 
the quickened system might be recognized 
even by naive Ss (or at least very early in 
training). If so, the dissimilarity of the 
stimuli might prevent negative transfer. Posi- 
tive transfer would then be explained as a re- 
sult of general familiarity with the tracking 
systems employed here. The second possi- 
bility is that certain principles of responding 
might be learned in one task and transferred 
to the other task. For example, S might learn 
to avoid some amplitudes of stick movement 
which should not be used for either task. 
Thus, it would not be a matter of transfer of 
the actual stimulus-response relationship but 
rather of eliminating certain components of 





366 


the response pattern which would be errone- 
ous in either tracking task. 

The answer to this problem must await re- 
search which determines the psychological 
nature of tracking behavior. Such basic work 
should permit prediction of how the opera- 
tor’s performance would vary as a function 
of many other variables. Until such informa- 
tion is available the task similarity theory of 
transfer will probably be of little value in 
predicting transfer effects between different 
continuous tracking systems. 


Summary 


This study was designed to determine the 
direction and extent of transfer of training 
for Ss switched to a quickened tracking sys- 
tem after having been trained with an un- 
quickened system and for Ss switched to an 
unquickened tracking system after having 
been trained with a quickened system. 

Four groups of six Ss each were used. Two 
groups were trained on the unquickened sys- 
tem and two groups were trained on the 
quickened system. Training consisted of 140 
40-sec. trials for one of the groups trained on 
the unquickened system and for one of the 


groups trained on the quickened system. For 
the remaining two groups training consisted 
of 260 40-sec. trials, one on the unquickened 
system and the other on the quickened sys- 
tem. After training, each group was switched 


to the system for which it was naive. Trans- 
fer of training was evaluated by comparing 


James G. Holland and Jean B. Henson 


the performance during this initial test ses- 
sion with the first training session of the two 
groups which originally were trained on the 
system in question. 

The results of the experiment suggest a 
number of conclusions. 

1. Positive transfer occurs in switching 
either from unquickened to quickened sys- 
tems or from quickened to unquickened sys- 
tems. 

2. Different amounts of training, within 
the range employed in the present study, 
provide no difference in the extent of transfer. 

3. Transfer of training between these two 
systems is not complete. Thus, some train- 
ing is necessary before the full potential of 
the new system is reached. 


Received March 15, 1956. 


References 


. Birmingham, H. P., & Taylor, F. V. A design 
philosophy for man-machine control systems. 
Proc. Inst. Radio Engrs, 1954, 42, 1748-1758. 

2. Lincoln, R. S. Visual tracking: III. The instru- 
mental dimension of motion in relation to 
tracking accuracy. J. appl. Psychol., 1953, 
37, 489-493. 

3. Osgood, C. E. The similarity paradox in human 
learning: a resolution. Psychol. Rev., 1949, 
56, 132-143. 

. Searle, L. V. Psychological studies of tracking 
behavior. VI. The intermittency hypothesis 
as a basis for predicting optimum aided- 
tracking time constants. U. S. Naval Res. 
Lab. Rep., 1951, No. 3872. 





Journal of Applied Psychology 
Vol. 40, No. 6, 1956 


Theory and Analysis of Component Errors in Aided Pursuit Tracking 
in Relation to Target Speed and Aided-Tracking Time Constant’ 


J. Richard Simon * and Karl U. Smith 


University of Wisconsin 


This study deals with an analysis of op- 
erator errors in aided pursuit tracking. The 
aim is to determine how variations in target 
speed and aided-tracking time constant affect 
the type of errors made. 

Aided tracking is partial automation of the 
steering function in tracking. Its object is to 
simplify the operator’s task and thus reduce 
error. The aid supplied is a rate of cursor 
movement which is automatically generated 
by a motor system as the operator adjusts 
his hand control to follow the target. The 
amount of aid in a system is expressed by 
the aided-tracking time constant. This con- 
stant is the ratio between the amount of 
direct displacement of the cursor and the 
change in velocity of the cursor per unit of 
control movement. 

The objectives of this study are both of a 
theoretical and applied nature. What is the 
actual psychological effect of a tracking aid? 
Is the aid an actual automation of rate-con- 
trol movements, as it is supposed to be? 
Are there other effects of the aid on the track- 
ing behavior? These questions are basic to 
many theoretical questions beyond the ap- 
plied problem of the tracking aid. Partial 
answers to them are supplied by the present 
experiment. 


Method 
Apparatus 


The aided-pursuit tracking device used in this 
study has been described elsewhere (4, 8). The op- 
erator’s task is to keep a cursor aligned with a mov- 
ing target by adjusting a handwheel control. The 
pattern of target movement is determined by un 
irregularly shaped cam driven by a 1-r.p.m. motor. 
The target moves in a radial course involving nine 
reversals of direction and continuous changes in 
velocity. Since target velocity is continually chang- 


1 This research has been supported by funds voted 
by the Legislature of the State of Wisconsin and as- 
signed by the Graduate School Research Committee, 
the University of Wisconsin. 

2 Presently on a Fulbright research grant at the 
Psychological Laboratory, University of Cambridge, 
England. 


ing, the target-speed variable is expressed in terms 
of the r.p.m. of a variable-speed motor which drives 
the target through a ball-and-disc integrator. Pat- 
tern of target movement remains constant at the 
various target speeds so over-all speed changes are 
a result of slight increases in the extent of the ten 
back-and-forth sweeps of the target. 

The error-recording system employs a generator 
and receiver selsyn which continually compare the 
position of the cursor with that of the moving 
target. When target and cursor are properly aligned, 
the shaft of the receiver selsyn remains stationary. 
However, when target and cursor are not aligned, 
the shaft of the receiver selsyn moves off the zero 
error line in the direction and to the extent of the 
distance off target. An electrically heated writing 
point attached to the shaft of the receiver selsyn 
makes a continual tracing of tracking error on 
waxed kymograph paper. 


Experimental Design and Procedure 


The data consist of tracking records for 27 Ss. 
All Ss had been trained for four days in connection 
with another study (8). The Ss reported for a fifth 
day during which the present records were taken. 

The experimental design takes the form of a 
3 3 factorial in the cells of a replicated 9 x 9 
latin square. Two variables are manipulated simul- 
taneously. They are target speed and aided-track- 
ing time constant. The three target speeds used are 
23 r.p.m., 30 r.p.m., and 37 r.pm. The time con- 
stants are .25 sec., 0.5 sec., and 1.0 sec. The nine 
combinations of target speed and time constant make 
up the experimental conditions. Each S performs on 
each experimental condition in an order determined 
by one of the nine different sequences of conditions 
occurring in the latin square. 

Error records are analyzed? in a manner illus- 
trated by Fig. 1 (6). Three categories of errors are 
distinguished in terms of their duration or extent 
along the time axis, short wavelength errors, errors 
of intermediate wavelength, and long wavelength 
errors. When these error categories are defined in 
terms of duration they are, approximately, short 
wavelength errors, less than one second, intermedi- 
ate wavelength errors, between 1 per second and 1 
per 3.5 seconds, long wavelength errors 3.5 seconds 
or more. 


‘ 


8 The authors gratefully acknowledge the pains- 
taking efforts of Miss Anne Mathews in reading the 
error records and the assistance of Betty Pearl Si- 
mon in collecting the data. 


367 





J. Richard Simon and Karl U. Smith 





AMPLITUDE OF ERRORS 





—— TIME ——> 


Fic. 1. Enlarged drawing of part of a tracking 
error record showing the method used to categorize 
errors. Excursions from zero error are measured 
from the point at which the excursion begins to the 
point at which it ends. The wavelengths of super- 
imposed errors are measured parallel to the zero 
error line. The starting or end point of a super- 
imposed error defines the distance from the zero 
error line at which this measure is taken. The let- 
ters R, I, P, on the record mean, respectively, rate 
control, intermediate, and positioning errors. 


Results 


Figure 2 pictures the mean number of 
errors in the three categories as a function of 
aided-tracking time constant. Each value 
represents the average number of errors of a 
given wavelength made during a one-minute 
trial. Data from the three target speeds are 
combined. It can be noted that the number 
of long wavelength rate errors and short 
wavelength positioning errors both increase 
significantly as time constant increases, i.e., 
as the amount of aiding or rate control de- 
creases. However, the number of intermedi- 
ate wavelength errors shows a significant de- 
crease as the aiding decreases. 

Figure 3 pictures the mean number of 
errors in the three categories as a function 
of target speed. Data from the three time 
constants are combined. Only the number of 
short wavelength errors shows a significant 
increase with increasing target speed. There 
is a tendency for both long wavelength errors 


MEAN 


NUMBER 


OF ERRORS O———-© _ INTERMEDIATE WAVELENGTH 


O——O SHORT WAVELENGTH 


35 | @ LONG WAVELENGTH 


30 


25 


20 


15 








AIDED TRACKING TIME CONSTANT 


Fic. 2. Mean number of errors in the three cate- 
gories as a function of aided-tracking time con- 
stant. 


and intermediate errors to increase with tar- 
get speed, but the over-all Fs are not sta- 
tistically significant. 

Tables 1, 2, and 3 summarize the analyses 
of variance of the frequency of errors in the 


O——o_INTERMECIATE WAVELENGTH 
O——O ~ SHORT WAVELENGTH 
@-----@ LONG WAVELENGTH 


OF ERRORS 


asb 


ee 


——— 


-®---------® 








TARGET SPEED 


Fic. 3. Mean number of errors in the three cate- 


gories as a function of target speed. 








Component Errors in Aided Pursuit Tracking 


Table 1 


Summary of Analysis of Variance 
Short Wavelength Errors 





Mean 


Source Square 


Table 3 


Summary of Analysis of Variance 
Intermediate Wavelength Errors 


Mean 


Source df Square F 





4.9901 
13.2004 


Target speed 

Time constant 

Speed X constant 
interaction 

Trials 

Sequences 


4058 
1.0295 
5.7482 
3.4736 

3236 

.2708 


Residual between Ss 
Square uniqueness 
Residual error 





* Significant at the .01 level of confidence. 


three categories. Bartlett chi-square tests! in- 
dicated heterogeneity of variance (2) so a 
x + .5 transformation was used. 

Table 1 is a summary of the analysis of 
the short wavelength (duration) errors. Both 
target speed and aided-tracking time constant 
are significant sources of variation when 
tested against residual error. The significant 
F for trials reflects the tendency for the num- 
ber of short wavelength errors to decrease 
over the trials within a single experimental 
session. 

Tables 2 and 3 summarize the analyses of 
the long wavelength errors and the intermedi- 
ate category errors. In both analyses, the 


Table 2 


Summary of Analysis of Variance 
Long Wavelength Errors 


Mean 
Square F 
5101 2.90 
12.5589 71.52* 


Source df 

Target speed 
Time constant 
Speed X constant 

interaction 
Trials 
Sequences 
Residual between Ss 
Square uniqueness 
Residual error 


3799 
1955 
1.7117 
1.1105 
1756 
1159 


2.16 
1.11 
1.54 
6.32* 
132" 





* Significant at the .01 level of confidence. 


Target speed 2 
Time constant 2 
Speed X constant 

interaction 
Trials 
Sequences 


7768 3.02 
5.3984 20,96" 


3752 
.1800 
1.5110 
1.2692 
BF f.7 


.1368 


1.46 
1.43 
1.19 
4.93* 
1.88* 


Residual between Ss 
Square uniqueness 
Residual error 


* Significant at the .01 level of confidence. 


square-uniqueness mean square is signifi- 
cantly larger than the residual-error mean 
square and is therefore used as the error term 
to test target speed and time constant. In 
both cases, only time constant proves to be a 
significant source of variation. 


Discussion and Summary 


Records of error from 27 Ss are analyzed 
to find the relation between types of error in 
pursuit tracking and two main determinants 
of tracking accuracy, target speed, and aided- 
tracking time constant. Three categories of 
error are distinguished in the analysis. The 
short wavelength errors are thought to rep- 
resent positioning errors or quick adjustive 
movements to get back on course. The long 
wavelength errors probably represent errors 
in rate adjustment. 

The main finding of this study is that the 
psychological effects of an aiding device are 
complex. Different types of movement which 
produce error are differentially affected by 
the aid. Increasing the aiding decreases the 
frequency of short wavelength (fine position- 
ing) and long wavelength (rate control) er- 
rors. However, errors of intermediate wave- 
length are increased in frequency when aiding 
is increased (ie., when the aided-tracking 
time constant is decreased). 

Increasing target speed generally increases 
the frequency of all types of error but this 





370 


increase is statistically significant only for 
errors of intermediate wavelength. 

The fact that the long wavelength errors 
decrease in number as aiding is increased sug- 
gests that the aid, within narrow limits, is an 
effective automation of the rate control move- 
ments. Increasing the aiding also reduces the 
number of short wavelength errors. This 
finding is in keeping with our previous (8) 
claim that a main effect of the aid is to filter 
out a certain percentage of the fine position- 
ing errors. 

Intermediate wavelength errors, which ac- 
count for most of the errors in the task, are 
increased in number with increased aiding. 
This finding both confirms and explains prior 
data (3, 5) which indicate that the aided- 
tracking device is not an aid at all but a 
hampering device. The instrumental addition 
in aided tracking produces transformations of 
movements which increase both the percep- 
tual and reactive complexity of the task be- 
yond that required in unaided tracking. Only 
in the simplest tracking tasks, involving very 
slow target speeds and uniform target courses, 
will performance be improved by the aid. 
The aid is more than an automation of rate. 
It is a filter for rapid movements and a de- 
terrent to good position control. 

The “optimum” aided-tracking time con- 
stant has been shown to vary within the gen- 
eral range of .25 to 1.0 sec. (1). Present re- 
sults indicate that the optimum constant is 
an outcome of many complex negative and 
positive effects of the aid on the component 
movements in tracking. We cannot agree 
with the interpretation of Mechler, Russel, 
and Preston (7) that an optimum time con- 
stant of .5 sec. can be interpreted as a spe- 
cific reaction-time property of discrete move- 
ments in tracking. Weaknesses in this theory 
have been pointed out previously (8). The 
present findings put an even greater burden 
on the theory for it cannot be extended to 
make provision for the compound positive 


J. Richard Simon and Karl U. Smith 


and negative effects of an aid on the different 
types of error-producing movements. 

The experimental facts presented here give 
further support to what we have called a 
resonance theory of tracking (8). Tracking 
movements are a spectrum of continuous 
oscillatory movements related to positioning 
control and rate control. The oscillatory fea- 
tures of these movements are defined funda- 
mentally by the orbits of hand, arm, and 
body motion in tracking. The dynamic cor- 


respondence between these orbits of move- 
ment and the mechanical properties of the 
tracking device, in relation to target course 
and target speed, define the nature of track- 
ing error. 


Received February 1°, 1956. 


References 


. Andreas, B. G., & Weiss, B. W. Review of re- 
search on perceptual motor performance un- 
der varied display-control relationships. Roch- 
ester, N. Y.: University of Rochester, 1954. 
(Sci. Rep. No. 2, Contract AF 30(602)-200.) 

. Edwards, A. L. Homogeneity of variance and 
the latin square design. Psychol. Bull., 1950, 
47, 118-129. 

. Lincoln, R. S. Visual tracking: III. The instru- 
mental dimension of motion in relation to 
tracking accuracy. J. appl. Psychol., 1953, 
37, 489-493. 

. Lincoln, R. S., & Smith, K. U. Transfer of train- 
ing in tracking performance at different target 
speeds. J. appl. Psychol., 1951, 35, 358-362. 

. Lincoln, R. S., & Smith, K. U. Systematic analy- 
sis of factors determining accuracy in visual 
tracking. Science, 1952, 116, 183-187. 

. Lincoln, R. S., Simon, J. R., & De Crow, T. W. 
The effects of practice upon different com- 
ponent movements in visual tracking. Per- 
cept. & Mot. Skills Res. Exch., 1952, 4, 123- 
131. 

. Mechler, E. A., Russel, J. B., & Preston, M. G. 
The basis of the optimum aided-tracking time 
constant. J. Franklin Inst., 1949, 248, 327- 
334. 

. Pearl, Betty E., Simon, J. R., & Smith, K. U. 
Visual tracking: IV. Interrelations of target 
speed and aided-tracking ratio in defining 
tracking accuracy. J. appl. Psychol., 1955, 
39, 209-214. 








Journal of Applied Psychology 
Vol. 40, No. 6, 1956 


Ability Grouping in Army Basic Combat Training ' 


Donald C. Findlay, Seymour M. Matyas, and Hermann Rogge III 
Human Research Unit Nr 1, CONARC, Fort Knox, Kentucky 


Whether or not to group students by intel- 
lectual level has been a persistent problem 
for educators and training supervisors since 
the introduction of the group intelligence test. 
Of the several objections raised against the 
practice of ability grouping, the most fre- 
quent criticism has been that segregating slow 
learners deprives them of the help and stimu- 
lation provided by rapid learners. 

This study investigated the benefits of het- 
erogeneous ability grouping in Army Basic 
Combat Training, an eight-week program 
stressing fundamental military skills. The 
hypothesis tested was: Low-ability men, when 
grouped with higher ability men in training, 
will reach a significantly higher level of pro- 
ficiency than low-ability trainees grouped by 
themselves. 


Method 


In this study, low-ability men were trained in 
squads with the usual heterogeneous (low-medium- 
high) spread of intelligence scores, as well as in spe- 
cial squads (low-high) from which medium ability 
men were excluded. A third type of squad, contain- 
ing low-ability men only (low) served as a control. 
To facilitate “interaction” learning within each type 
of squad, a system of competition was established 
in which rewards were given on the basis of squad 
proficiency rather than individual proficiency. 

If the hypothesis was valid the low-ability men in 
squads with high-ability men (low-high) would at- 
tain the highest level of achievement, low men in 
heterogeneous squads (low-medium-high) would be 
second, and low men in squads by themselves (low) 
would be third. ‘ 

Subjects. Two experimental companies, each con- 
taining approximately 200 men, were specially or- 
ganized to meet the intelligence requirements of the 
study. These men had been in the Army only a 
few days before their assignment to an experimental 
company. They were selected on the basis of in- 


1 The research reported here was conducted by the 
senior author while he was employed by The George 
Washington University, Human Resources Research 
Office, operating under contract with the Depart- 
ment of the Army. The junior authors were Army 
enlisted men assigned to Human Research Unit Nr 1. 
Opinions and conclusions are those of the authors 
and should not be construed as representing those of 
the Department of the Army. 


371 


telligence only, using the Aptitude Area I (AAI) 
score of the Army Classification Battery (1). The 
Aptitude Area I scale, which has a mean of 100 and 
a standard deviation of 20, is based on the average 
score of the trainee’s performance on three tests: 
Reading and Vocabulary, Arithmetic Reasoning, and 
Pattern Analysis. 

Criterion. A four-hour performance test, adminis- 
tered at the end of the training period, was the ex- 
perimental criterion. This test, for which norms 
(based on other companies) are available, is a com- 
prehensive and reliable instrument, specifically con- 
structed to measure proficiency in Basic Combat 
skills (2). 

Procedure. The experiment was conducted inde- 
pendently, but in the same manner, within each of 
the two companies. The officers, cadre, and physi- 
cal facilities of the companies were not predeter- 
mined; they were company organizations which hap- 
pened to be available for filling at the time of the 
experiment. In each company all subjects attended 
the same classes, received the same instruction, and 
in general experienced similar treatment from the 
Army. 

Three levels of intelligence or ability were defined: 
(a) low ability, AAI score of 90 or lower; (b) me- 
dium ability, AAI score of 91 through 110; and (c) 
high ability, AAI score of 111 or higher. Frequen- 
cies within each of the categories were proportional 
to the distribution of the AAI population. In each 
company, these ability levels were used to form the 
three types of squads (low-medium-high, low-high, 
low), with subjects from a given level being assigned 
randomly to appropriate squad type. 

The low-medium-high squads were formed to give 
an intelligence composition approximating that usu- 
ally found in training companies, and contained 25% 
low-ability men, 50% medium-ability men, and 25% 
high-ability men. Within each of these squads, low- 
ability men could associate with medium- and high- 
ability men. The low-high squads, constructed to 
provide maximum opportunity for association be- 
tween low-ability men and high-ability men, con- 
tained an equal number of each. The only kinds of 
association possible were (a) association of men 
whose abilities were approximately equal, and (b) 
association of high-ability men and low-ability men. 

The low squads, by containing only low-ability 
men, permitted low-ability men to associate, within 
the squad, with other low-ability men only. 

The experimental companies were organized in a 
manner intended to restrict the association and com- 
munication of the low-ability trainee to the group of 
men who composed his squad or squads like his. 
Each of the four platoons in a company was com- 





372 


posed of four squads of the same type; thus a com- 
pany contained two platoons of low-medium-high 
squads, one platoon of low-high squads, and one 
platoon of low squads. Each platoon was housed 
separately, a specific barracks area being assigned 
to each squad. 3 

To obtain a situation especially conducive to inter- 
action learning in squads which contained low- and 
higher-ability men, a weekly competition was held 
in the form of a proficiency test over the material 
covered in the week’s instruction. Although testing 
was individual, competition was based on the aver- 
age score of the entire squad. 

The competition took place within the platoon 
only so that squads always competed against squads 
of identical ability composition. Each platoon had 
a winning and a losing squad each week, with the 
winning squad receiving week-end passes, exemption 
from extraduty work details, and priority in the 
mess line. The losing squad in each platoon received 
no passes, ate last, and performed most of the extra- 
duty work details during the following week. 

To inform each squad of its rank in the platoon 
and to identify the men within the squad whose 
scores had raised or lowered the group’s score, indi- 
vidual scores as well as squad scores were posted 
each week. Thus as the training program progressed, 
higher ability men could learn which of their squad- 
mates needed assistance, and similarly, low-ability 
men could learn which of their squadmates could 
give them assistance. 

Criterion testing with the four-hour proficiency 
test was conducted in the two companies on the last 
day of the Basic Training program. 


Results 


The performance of low-ability men on the 
criterion test was analyzed by kind of group- 
ing (squad type) and by company. The 
analysis of variance is summarized in Table 
1. Neither grouping nor company difference 
was significant. The interaction was also 
nonsignificant. 

Differences between the three levels of 
ability (low, medium, and high) were found 


Table 1 


Summarized Analysis of Variance of Performance 
on Final Proficiency Test by Low-Ability 
Trainees in Three Ability Groupings 





Mean 
Square 


Groups 2 21.69 
Companies 1 122.32 
Groups X companies 2 207.26 


Error 142 147.47 


Sources of Variance df 








Donald C. Findlay, Seymour M. Matyas, and Hermann Rogge III 


Table 2 


Summarized Results of Analysis of Variance of 
Performance on Final Proficiency Test 
by Three Levels of Ability 








Mean 
Square 
Ability levels 2 = 14,429.87 
Companies 1 703.56 
Ability levels X 

companies 
Error 


Sources of 
Variance df 


90.93 





to be significant beyond the .001 level in an 
analysis of variance summarized in Table 2. 
The company difference and the interaction 
were not significant. 

A mean score for the experimental com- 
panies on the criterion test was computed by 
averaging the mean scores of subjects within 
intervals of ten points on the AAI scale, i.e., 
the mean score of subjects with AAI scores 
between 71 and 80, between 81 and 90, etc. 
When this mean experimental company score 
was compared with similarly derived mean 
scores of norm companies (norms for pre- 
training companies as well as _ posttraining 
companies being available), it was found 
that experimental companies had apparently 
learned about 28% more than the average 
company. Analysis of variance revealed that 
differences between experimental and norm 
companies were significant beyond the .005 
level. 


Discussion 


The results of this study fail to support the 
view that low-ability trainees profit by asso- 
ciation in training with higher ability trainees. 
Even though special motivational and organi- 
zational conditions were introduced to facili- 
tate the hypothesized interaction learning, 
low-ability men learned no more in squads 
with high-ability men or in squads with high- 
and medium-ability men than in squads with 
other low-ability men only. 

That these results did not come about 
through indifference of subjects to the pro- 
gram of squad competition seems to be shown 
by the unusually high performance of trainees 
of all ability levels on the final proficiency 








Ability Grouping in Army Basic Combat Training 


test. Likewise, the superiority of both me- 
dium- and high-ability men on the final pro- 
ficiency test indicates that higher ability sub- 
jects actually knew enough to be of help to 
their low-ability squadmates. Although men 
of all levels apparently wanted to be in win- 
ning squads and although the higher ability 
men apparently could have helped lower abil- 
ity men in their squads, it seems that (a) the 
higher ability men failed to give help, or (6) 
the help given was not sufficient to produce 
differential proficiency between low-ability 
men in the various kinds of squads. 

Since heterogeneous grouping failed to in- 
crease achievement in a situation which was 
deliberately made conducive to interaction 
learning, it is unlikely that such benefits will 
occur in similar training situations in which 
there are no special conditions. 


Summary 


This study investigated the effectiveness of 
heterogeneous ability grouping as a method 
of increasing proficiency in Army Basic Com- 
bat Training. In each of two companies, low- 
ability trainees were trained under three con- 
ditions of ability grouping. One group of 


373 


low-ability men trained in squads containing 
only low-ability men (low), one group in 
squads containing high- and medium-ability 
men also (low-medium-high), and one group 
in squads containing high men also (low- 
high). In spite of a system of competition 
that made privileges dependent on squad per- 
formance, a proficiency test given at the end 
of eight weeks of training failed to show a 
significant difference between the learning of 
low-ability men who had high-aptitude men 
in their squads and those who did not. 
Achievement at all ability levels was unusu- 
ally high, but low men who were trained in 
squads by themselves were just as proficient 
as low men who were trained in squads with 
higher ability men. 


Received February 13, 1956. 


References 


. A manual for the army classification battery. De- 
partment of the Army, SR 615-25-27, Feb. 21, 
1951. 

. Baker, R. A., et al. Manual for the administra- 
tion of the individual proficiency tests for 
basic combat and advanced light infantry 
training. Human Resources Research Office, 
George Washington Univer., 1955 (TR 19). 





Journal of Applied Psychology 
Vol. 40, No. 6, 1956 


Differentiation of Individuals in Terms of Their Predictability 


Edwin E. Ghiselli 


University of California 


When scores on a test are unrelated to cri- 
terion scores or are related to them only to a 
very low degree, the presumption is that the 
test is of little value. Hence in a prediction 
or selection situation tests with low validity 
are quickly discarded and the entire effort is 
directed to the development of tests which 
will yield scores that are substantially related 
to the criterion. 

Even though the validity coefficient of a 
test is negligible, there is the possibility that 
at least with certain individuals reasonably 
accurate predictions of criterion performance 
nevertheless may be made from scores on the 
test. As one regards the scatter diagram of 
the scores on two variables that exhibit a low 
relationship, it is apparent that some indi- 
viduals fall on or very close to the line of re- 
lations while others depart markedly from it. 
Thus for some individuals there is quite close 
correspondence between standard scores on 
the test and standard scores on the criterion. 
The remainder of the individuals display to 
varying degrees differences between standard 
test and standard criterion scores. 

Suppose, as the author has suggested else- 
where (5), that it were possible by some 
other means, perhaps another test, to differ- 
entiate those individuals whose test and cri- 
terion scores show small discrepancies from 
those individuals whose test and criterion 
scores are markedly different. Then it would 
be possible to screen out a group for which at 
least reasonably accurate predictions can be 
made. Thus even though the validity of the 
test for the entire group is low, for some in- 
dividuals who can be differentiated before- 
hand, the test would have some practical 
utility. 

In a somewhat different form this notion is 
implicit in dealing with individual cases in 
clinical and guidance work. Consider the 
case of a counselor attempting to decide 
whether a young person should seek educa- 
tion above the secondary school level. If it 


appears that motivation or interests seem in- 
appropriate, he might not recommend college 
even though the intelligence test score is high. 
In effect, what is being said is that when the 
individual possesses certain other character- 
istics, there will be little correspondence be- 
tween test performance and college achieve- 
ment. 

Therefore there is nothing new in the no- 
tion that it is possible to differentiate between 
those individuals for whom a test is a good 
predictor and those for whom it is a poor one. 
However, it remains to be seen whether it is 
possible to make such a differentiation in a 
systematic and objective fashion. It is the 
purpose of the present investigation to ex- 
amine this possibility. 


Methods and Procedures 


Scores from one test and two inventories were ob- 
tained on candidates for the job of taxicab driver at 
the time of hiring. The test consisted of tapping 
and dotting items, and the inventories consisted of 
24 pairs of forced-choice items which sought to get 
at appropriateness of occupational level and interest 
in jobs involving personal relationships. The details 
of these devices have been described elsewhere (1). 
Previous investigations have indicated that these de- 
vices have some, though modest, validity for vari- 
ous aspects of the job of taxicab drivers (2, 3, 4). 

In the present investigation the criterion of job 
proficiency consisted of production during the first 
12 weeks of employment. Raw production figures 
were corrected for temporal variation and differ- 
ences in division in which the driver operated. Rec- 
ords were obtained on 193 men who were randomly 
divided into two groups, 100 comprising an experi- 
mental group, and 93 a cross-validation group. 


Results 


The validity coefficients of the three pre- 
dictors together with their intercorrelations 
for the experimental group are given in 
Table 1. The validity of the tapping and 
dotting test at best can be characterized as 
limited. Neither of the two inventories has 
any appreciable value as a selective device. 
It is apparent that any combination of scores 


374 








Differentiation 


Table 1 


Validity Coefficients of and Intercorrelations 
Among Predictor Variables for the 
Experimental Group 


Personal 
Relation- 
ships 
Inventory 
125; 
126 
.283 


Occu- 
pational 
Level 
Inventory 


Tapping 
and 
Dotting 


.259 


Variables 





055 
318 
.029 


Criterion 


Difference Score 
Tapping and Dotting 


on the test and either of the inventories, as 
through multiple correlation, would have no 
greater validity than that of the test alone. 

For each individual in the experimental 
group the difference between his standard 
score on the tapping and dotting test and his 
standard criterion score was computed. Dif- 
ferences in sign were ignored; hence an indi- 
vidual with a low difference score was one 
whose standard test and criterion scores were 
very similar, and an individual with a large 
difference score was one whose standard test 
and criterion scores were very different. The 
coefficients of correlation between these dif- 
ference scores and scores on the two inven- 
tories are given in Table 1. The coefficient 
of correlation was found to be of moderate 
size for the occupational level scale and low 
for the personal relationships scale. There- 
fore there was a tendency for those individu- 
als who made a low score on the occupational 
level inventory to display a correspondence 
between standard test and criterion scores, 
and for those individuals who made a high 
score to show a discrepancy between test and 
criterion scores. There was little such tend- 
ency in the case of the personal relationship 
inventory. 

From the foregoing it would appear that if 
only those individuals who made low scores 
on the occupational level inventory were used, 
the coefficient of correlation between scores 
on the tapping and dotting test and the cri- 
terion would be greater than the value of .259 
obtained for the entire group. However, no 
such tendency should result from a similar 
selection on the basis of the personal rela- 
tionship inventory. 


of Individuals 375 

To examine this notion, the validity coeffi- 
cients for the tapping and dotting test were 
calculated for the cross-validation group using 
three degrees of selectivity on the basis of 
the two inventories. The validity of the test 
scores was Calculated for the one-third and 
two-thirds earning the lowest scores on the 
two inventories. The first of these groups 
should be composed of the one-third of the 
individuals whose performance is quite pre- 
dictable with the least predictable two-thirds 
discarded. The second group should be com- 
posed of the individuals whose performance is 
fairly well predictable with the least pre- 
dictable one-third discarded. ° 

For the one-third of the individuals in the 
cross validation whose scores on the occupa- 
tional level inventory indicated their job per- 
formance should be quite predictable from 
the tapping and dotting test the validity co- 
efficient was found to be .664, whereas the 
validity coefficient for the most predictable 
two-thirds of the individuals was .323, and 
that for all cases only .220. On the other 
hand, for the one-third of the individuals 
whose scores on the personal relationships 
inventory indicated their job performance 
should be most predictable, the validity of 
the test was .000, for the most predictable 
two-thirds it was only .130, and for all cases 
it was .100. 

In a practical selection situation, such as 
the present one with taxicab drivers, a first 
elimination of applicants can be made by 
dropping out those individuals for whom pre- 
diction of job success by means of the selec- 
tion test is likely to be poor. Then a second 
elimination can be made on the basis of the 
selection test, picking those individuals whose 
scores are high. Thus in the present case 
those candidates scoring high on the occupa- 
tional level inventory could be first elimi- 
nated. This process would leave those whose 
performance is substantially related to scores 
on the tapping and dotting test. Then those 
scoring low in this test could be eliminated 
resulting in the retention of a group whose 
average criterion performance is high. If the 
personal relationship inventory were used, no 
such benefits should accrue. 

The question then is raised as to what pro- 





376 


portion of candidates should be dropped out 
by the first screening and what proportion by 
the second screening. For example, if it is 
desired to obtain from a group of individuals 
20% whose criterion performance will be sig- 
nificantly better than average, should 40% 
be dropped in the first screening and 40% in 
the second screening, or should 20% be 
dropped in the first screening and 60% in 
the second screening? No definitive answer to 
this question can be offered at the present 
time. Undoubtedly the optimal percentages 
to be eliminated in the two screenings will be 
a function of the magnitude of the correla- 
tions between the tests, the criterion, and the 
difference scores. 

On purely rational grounds it would appear 
that the optimal percentages eliminated in the 
two screenings would be nearly the same. If 
a very high proportion is eliminated in the 
first screening, while to be sure the predic- 
tion of success of the remainder will be good, 
there will be so few individuals left to elimi- 
nate in the second screening that there will 
be very little improvement in criterion scores. 
On the other hand, if very few are eliminated 
in the initial screening, then the validity of 
the selection test for the second screening will 
be so low that even with a high proportion 
eliminated the gain will be small. 

To illustrate the problem an example using 
the cross-validation group is presented in 
Fig. 1. The objective of the selection proc- 
ess is taken as the selection of the best 20% 
of candidates. Then various distributions of 
elimination between the two screenings can 
be made of the remaining 80%. At one ex- 
treme none can be eliminated in the first 
screening and the entire 80% can be elimi- 
nated on the basis of the second screening. 
At the other extreme 80% could be elimi- 
nated on the basis of the first screening and 
none in the second screening. The mean of 
the standard criterion scores of the “best” 
20% of individuals selected by various dis- 
tributions of percentages of elimination at 
the two stages were calculated. The mean 
criterion scores of the individuals remaining 
after elimination are shown in Fig. 1. 

Reference to Fig. 1 will show that using 
the occupational level inventory as the basis 


Edwin E. Ghiselli 





Occupational Level 
inventory — 


Mean Criterion Score of Selected Cases 


. a 
s ‘ 
Real” \ 


Personal Relationship Inventory Od \ 


Percent 
Eliminoted in. QOL n rl 1 1 r 1 1 4 
First Screening 12) 10 20 30 40 50 60 70 80 
Second Screening 80 70 60 50 40 30 20 10 ° 


Fic. 1. Mean criterion scores of workers surviv- 
ing the selection process under various conditions of 
selection. 











for the first elimination, when the very large 
proportion of individuals is eliminated either 
in the first screening or in the second screen- 
ing, the final results are poorest. A more 
equitable division of elimination between first 
screening and second screening is superior. 
Best results were obtained when a somewhat 
larger proportion was eliminated in the first 
than in the second screening. The personal 
relationship inventory, which has little or no 
value in selecting predictable individuals, does 
nothing to improve the selection of high per- 
formers on the criterion. 


Discussion 


The results of this study point to the pos- 
sibility of distinguishing applicants whose job 
performance can be predicted by ordinary 
selective procedures from those whose per- 
formance is poorly predicted. Selective pro- 
cedures, therefore, can be improved not only 
by the addition of highly valid predictors to 
present procedures, but also by the addition 
of devices to screen out individuals whose 
levels of aptitude and job proficiency show 
little correspondence. 

The investigation reported here is not suffi- 
ciently extensive to furnish many clues con- 
cerning the kinds of variables that will be 
useful in this type of screening. It seems 
likely that such variables will have a consid- 
erable degree of specificity for each particu- 
lar selection situation. However, the results 








Differentiation 


obtained with the occupational level inven- 
tory do suggest one interesting possibility. 
Each item in the inventory called for a 
choice to be made between two jobs in terms 
of their interest to the testee. The two jobs 
were similar in nature but one was at a 
lower and the other at a higher level, e.g., 
bookkeeping and accounting. Since the job 
of taxicab driver is only at the semi-skilled 
level, presumably it would not provide suffi- 
cient challenge for a person with higher oc- 
cupational ambitions. Therefore low scores 
on the inventory were taken as the most ap- 
propriate. 

As was seen, scores on the occupational in- 
ventory were unrelated to proficiency, yet 
they did distinguish those individuals whose 
aptitude and achievement levels were similar 
from those whose levels were different. If 
the inventory does measure occupational 


goals, then it would appear that inclusion 


of Individuals 


both of individuals whose goals are appro- 
priate and individuals whose goals are inap- 
propriate in a validation study masks the 
predictive power of the aptitude measure be- 
ing evaluated. 


Received January 27, 1956. 


References 


. Brown, C. W., & Ghiselli, E. E. 
skilled workers in relation to abilities and in- 
terests. Personnel Psychol., 1949, 2, 497-511. 

. Brown, C. W., & Ghiselli, E. E. Prediction of 
labor turnover by aptitude tests. J. appl. 
Psychol., 1953, 37, 9-12. 

3. Brown, C. W., & Ghiselli, E. E. The prediction 
of proficiency of taxicab drivers. J. appl. 
Psychol., 1953, 37, 437-439. 

. Ghiselli, E. E., & Brown, C. W. 
of accidents in taxicab drivers. 
chol., 1949, 33, 540-546. 

. Ghiselli, E. E. Worker selection: concepts and 
problems. Personnel Psychol., 1956, 9, 1-16 


Age of semi- 


The prediction 
J. appl. Psy- 





Journal of Applied Psychology 
Vol. 40, No. 6, 1956 





Optimum Letter Size For a Given Display Area’ 


C. S. Bridgman and E. A. Wade?’ 


University of Wisconsin 


A number of studies have investigated the 
visibility of printed materials as a function 
of the design and spacing of the symbols em- 
ployed (see summary in reference 5, Part III, 
Ch. IV, Section II). A related problem can 
be stated as follows. Given a certain display 
space, limited by a high-contrast border, what 
is the maximally visible size of inscribed let- 
ters? To make the problem more concrete, 
we can think of an instrument-panel window 
within which a line of letters is placed. Es- 
thetic and artistic considerations have always 
demanded a margin between inscribed letters 
and the limits of their background. On the 
other hand, printing, particularly lower case 
letters, can still be read when a considerable 
portion of the detail is masked, although this 
has not been tested under threshold condi- 
tions. However, in view of the recognized 
adverse influence of local brightness differ- 
ences on the perception of other nearby con- 
tours (2), better visibility might be found in 
the situation described above for letters small 
enough that their critical contours were some- 
what removed from the high contrast region 
at the edges of the viewing field. Thus better 
visibility might be achieved by using smaller 
letters than the maximum size possible in a 
given space. An experiment has been carried 
out to explore this problem using single lines 
of block capital letters, and a visual acuity 
criterion of visibility. 


Method 
Apparatus 


A variable magnification projector was used to 
measure visual acuity. Test material consisted of 
the five lines of five letters each on the projection 
slide of this instrument. The instrument has an 
independent opaque masking slide, with apertures 
which govern the relative size of the bright back- 

1 Supported in part by a grant from the Graduate 
Research Committee, from funds provided by the 
Wisconsin Alumni Research Foundation. 

2 Now at Tufts University. 

3 Clason Acuity Meter, formerly manufactured by 
Bausch & Lomb Optical Company. 











ground. This was modified to provide three ratios 
of letter size to vertical dimension of the back- 
ground. One of these apertures was made the same 
height as the letters, so that the limit of the back- 
ground was tangent to the upper and lower edges 
of the letters (one-to-one ratio). A second aperture 
provided clearance, above and below, one-fifth the 
size of the letters (ratio of letter to field size of 
1:1.4), and the third provided a clearance of 2.25 
times the letter size (1:5.5). It should be under- 
stood that the aperture was projected by the vari- 
able magnification system, so that the relations stated 
above were maintained as letter size was varied to 
determine thresholds. 

Projection and observation distance were both 20 
feet. The scale on the instrument was modified to 
read directly in minutes of visual angle subtended at 
S’s eye by the projected image of the letters (ie., 
five times the unit dimension). 

Two background luminance levels were employed, 
8.45 mL. and 0.084 mL. The former was achieved 
by using an aluminized projection screen, the latter 
by projecting onto a flat black mat surface. 


Subjects 


The forty subjects used in this study were men 
and women students from elementary courses in psy- 
chology at the University of Wisconsin. Each had 
at least 20/20 vision as measured on a printed Snel- 
len test chart. Binocular viewing was used through- 
out the experiment. 


Procedure 


To minimize apparent improvements in acuity 
based on increasing familiarity with test material as 
the sessions progressed, Ss were instructed to adopt 
an “ease of reading” criterion. As size was increased, 
the S was asked to state at which size the letters 
first appeared to be just easily readable. As a check, 
he was asked to read the letters. If he missed more 
than one letter, a new determination of acuity was 
made. 

The general experimental procedure was as fol- 
lows: First, one measurement was taken with all five 
lines of letters exposed, at the higher intensity. 
Then, for one group of 20 Ss, the smallest aperture 
was introduced and five thresholds determined, one 
for each line of letters. The procedure was repeated 
for the 1.4 field and then for the 5.5 field. Thresh- 
olds were then determined for the lower intensity, 
with field size presented in the same order. 

For the second group the procedure was the same, 
except the order of presentation of field size was 
reversed. 


378 








Optimum Letter Size 


Finally, a recheck was made at high intensity with 
all five lines exposed, to determine the extent of 
practice effects. Apparently the threshold technique 
employed was effective in reducing such effects, be- 
cause the mean final thresholds were less than 0.2 
minutes of visual angle lower than the initial thresh- 
old for each group. 


Order of presentation of the five lines of letters 
was counterbalanced among Ss and conditions. 


Results 


Mean thresholds for each subject for each 
condition were determined, and the over-all 
means, for both groups combined, are pre- 
sented in Table 1. As expected, thresholds 
are higher (poorer acuity) when the edge of 
the background field is closer to the letters. 
An analysis of variance of the individual 
means indicated that the relative field size 
variable is statistically significant (p < .001). 
Thresholds improved nearly 119% at both lu- 
minance levels when the small surround was 
added, and 18 to 20% when the larger sur- 
round was provided. 

We were also interested however, accord- 
ing to the original question, in determining 
the total (vertical) size of the display under 
these threshold conditions in order to see if 
providing a background or surround improves 
visual acuity sufficiently to compensate for 
the extra space taken up, i.e., enough to per- 
mit readable letters in a space the same as or 
smaller than that required when the letters 
are presented without a surround. The total 
vertical dimension of each of the three dis- 
plays, at threshold, can be obtained by multi- 
plying the threshold letter size in Table 1 by 
the corresponding field size ratio. The re- 
sults are presented in Table 2. Even for the 
smallest size of field used, which provided a 
clearance above and below the letters only 
equal to the stroke width, the threshold field 


Table 1 


Threshold Letter Size, in Minutes of Visual Angle 


Luminance Level (mL.) 
Ratio of Field ———_—_—— 


to Letter Size 0.084 


1.0 f 8.93 
1.4 5.4 7.97 
5.5 7.29 


Table 2 


Size of Field (Vertical Dimension in Minutes of Arc) 
with Letters at Threshold 


Luminance Level (mL.) 
Ratio of Field 
to Letter Size 
1.0 
1.4 
5:5 


size is increased by about 25%, at both lu- 
minance levels, as compared to the no-sur- 
round condition. ° 


Discussion 


The effect of size of surround on various 
visual thresholds has been extensively investi- 
gated (1, 3, 4). In general, presentation of 
a threshold test object on a small surround 
results in poorer performance (higher thresh- 
olds) than with a larger surround. Although 
these phenomena are commonly formulated 
in terms of the area of the surround, Fry and 
Bartley have shown (2) that a critical factor 
in such findings is the proximity of a “border” 
(e.g., the transition from the illuminated sur- 
round to the dark background) to the thresh- 
old contours. Since visual acuity thresholds 
depend on the establishment of threshold 
gradients of excitation corresponding to the 
differences in intensity in the retinal image, 
it is not surprising to find that acuity thresh- 
olds are similarly depressed when the sur- 
round field is reduced and the border is closer 
to the test letters. 

Some part of the loss in acuity when the 
border was contiguous with the upper and 
lower edges of the letters might be attributed 
to modification and obliteration of some of 
the form and shape characteristics of the let- 
ters. This factor would presumably supple- 
ment the threshold depressing effects of the 
reduction in surround and proximity of the 
border. Actually, with the type of letters 
used in this experiment, there appeared to be 
little confusion introduced by this factor. 

Usual practice would dictate leaving a clear 
surround of considerable extent between let- 
tering and the edge of a drawing or between 
the letters and any border lines surrounding 











380 C. S. Bridgman 
the letters. Although the context of this ex- 
periment is not closely comparable to such 
situations, it would appear that part of the 
justification of this practice can be found in 
actual improvement of threshold discrimina- 
bility of letters of a given size when a sur- 
rounding field is provided. More improve- 
ment would occur, however, if the available 
space were utilized by making the letters 
nearly as large as possible, since it was found 
that leaving a clear field only as large as 
the stroke width of the letters required 25% 
more space, over all, to provide letters of 
threshold size, and this under comditions 
where the border was of high contrast, and 
therefore presumably having maximum effect 
on threshold. 

It is possible that tests made with some 
even narrower field would have resulted in 
field size at threshold equal to, or perhaps 
slightly smaller than, that obtained when the 
letters completely filled the field. However, 
when space limitations are a primary consid- 
eration, letters should be made as large as 
possible, at least to the point of very nearly 
filling the available space, in order to permit 
discrimination at a maximum distance. 

Some letters and other similar symbols 
might be especially adversely affected by hav- 
ing the edge of the field contiguous with the 
symbol. Consequently, it would scarcely be 
recommended to leave no discriminable field. 
Also it should be noted that other criteria of 
visibility, such as speed of recognition of let- 
ters or words of supraliminal size, might show 
more adverse effect of proximity of a high- 
contrast border. Although this experiment 
was Carried out with a 20-foot viewing dis- 
tance, there seems to be no reason to sup- 
pose that the retinal mechanisms involved 
would not operate in a similar manner with 
shorter viewing distances, if the same cri- 
terion of visibility (acuity threshold) were 
employed. 


and E. A. Wade 


Summary 


Visual acuity determinations were made on 
two groups of 20 subjects each at two lumi- 
nance levels (8.45 and 0.084 mL.) with three 
conditions of surround, or field clearance, 
above and below the line of letters. Provid- 
ing a field equal to the stroke width of the 
letters gave improvement in mean acuity 
thresholds of nearly 11% over those obtained 
with no field, and the wider field (2.25 times 
the letter size) gave improvements of 18 to 
20%. These proportional increases were ap- 
proximately equal at both luminance levels. 

When the data are examined in terms of 
the over-all size of field required to provide 
threshold letters, however, it is found that 
the decrease in letter size is not enough to 
compensate for the additional space taken up 
by the field. It is concluded that, when 
space limitations are a consideration, letters 
should be made as large as possible up to the 
point of very nearly filling the available space 
(margin less than the stroke width of the 
letters), in order to permit discrimination at 
a maximum distance. 


Received January 25, 1956. 


References 


1. Fry, G. A. Effects of uniform and non-uniform 
surrounds on foveal vision. Amer. J. Optom. 
& Arch. Amer. Acad. Optom., 1950, 27, 423- 
436. 

2. Fry, G. A., & Bartley, S. H. The effect of one 
border in the visual field upon the threshold 
of another. Amer. J. Physiol. 1935, 112, 
414-421. 

3. Ratoosh, P., & Graham, C. H. Areal effects in 
foveal brightness discrimination. J. exp. Psy- 
chol., 1951, 42, 367-375. 


ow 


4. Wald, G. Area and visual threshold. J. gen. 
Physiol., 1938, 21, 269-287. 
5. Handbook of human engineering data. Tufts 


College, 1952. 











Journal of Applied Psychology 
Vol. 40, No. 6, 1956 


Empirical Assessment of Handrail Diameters ’ 


Norman B. Hall, Jr. 
Dunlap and Associates, Inc. 
and Edward M. Bennett 


Tufts University 


The present study * considers one specifi- 
cation of public stairways, the handrail di- 
ameter. The American Standard Association 
(1) reports, ‘The size rail shall be; where of 
hard wood at least 2 inches in diameter; 
where of metal pipe at least 1.5 inches in 
diameter.” 

One major insurance company (2) speci- 
fies, “Rail should be approximately 2 inch in 
width and should be either round or so 
shaped as to permit comfortable grasping.” 

The following reports an empirical assess- 
ment of these specifications for round hard- 
wood rail. Since the large majority of hand- 
rail users are women, and since the construc- 
tion of women’s shoes and skirts adds a safety 
hazard which must partially be offset by 
handrails, only women were used in the 
study. 

Fifty-one female clerical, employees, vary- 
ing in age from 20 to 60, were studied in 
counterbalanced ascending and descending se- 
ries. They were questioned four times con- 
cerning the handrail, twice going up and 
twice coming down stairs. A double forced- 
choice method of questioning was used. 

The stairway used in this study was one 
flight of stairs between the third floor and 
the landing one-half flight down, in a mod- 
ern office building. The continuous hand- 
rail was removed and replaced by four ex- 
perimental sections of 1.5, 1.75, 2.00 and 
2.25 inches diameter. The four sections were 
of equal length and placed in decreasing 
diameter for descent (increasing diameter 
ascending). 

The subjects were instructed as follows 
before ascending or descending, “Will you 


' At the request of, and sponsored by, the Liberty 
Mutual Insurance Company, Boston. 

*One of a series of studies in the psychology of 
safety. 


please use the handrail while going down 
(up) the stairs. When you have reached the 
bottom (top) you will be asked some ques- 
tions with regard to the handrail.” 

Question 1, preference, asked, “Which sec- 
tion was most pleasing to use? Which was 
least pleasing? From the two remaining, 
which was the most pleasing? Which was 
the least pleasing?” 

Question 2, felt safety, asked, “Which sec- 
tion do you feel would have given you the 
most security if you had started to fall? 
Which the least security?” 

As a result we had four choices (weighted 
4, 3, 2, and 1) for two questions (preference 
and felt safety), for four rail diameters (1.50, 
1.75, 2.00, and 2.25 inches). 

Two analyses were considered. First, the 
distribution of diameters scored by the 51 
subjects as first choice (highest preferred and 
highest felt safety), for ascent and descent. 
These results are shown in Fig. 1. 

Second, the distribution of mean choice 
(preference and felt safety) scores for the 
various diameters, ascending and descending 
combined. Results are shown in Fig. 2. 


we 


so 


PER CENT 


PREFERENCE FELT SAFETY 








1 1 1 
150 175 200 225 ; 200 228 
SIZE 
Percentage of cases choosing various hand- 
rail diameters as first choice. 


Fic. 1. 


381 





Norman B. Hall, Jr. and Edward M. Bennett 


Fe 


MEAN CHOICE INTENSITY 


4 





4 
rs 2 
nome S 


Fic. 2. Mean choice intensity as a function of hand- 
rail diameter. 


Based upon these findings the following 
conclusions were drawn: 

1. Diameters of 1.75 and 2.00 inches are 
about equally preferred in descent. The di- 
ameter of 2.00 inches is most preferred for 
ascent. 

2. The diameter of 2.00 inches gives the 
greatest feeling of safety bot& in ascent and 
descent. 

3. In terms of over-all preference a di- 
ameter of 1.90 is suggested as ideal. The 
suggested latitude of deviation from this ideal 
is in the up direction, but not to exceed 2.00 
inches. 

4. In terms of felt safety a diameter of 
2.00 is suggested. 

The distributions of first choices for each 


of the rail diameters were significantly dif- 
ferent from those to be expected on the basis 
of chance. When Question 1 was used, the 
distribution resulted in a chi square of 102, 
significant at beyond the .01 level. When 
Question 2 was used, the chi square was 122, 
also significant at beyond the .01 level. 

There is also a small but real difference 
in the distributions obtained depending pon 
which of the two questions is used. When a 
comparison is made between questions, the 
chi square is 9.36, which is slightly beyond 
the .05 level of confidence. 


Summary 


An experimental study of public stairway 
handrail diameters suggests that a diameter 
between 1.75 and 2.00 is to be preferred. A 
diameter of 2.00 feels the most safe. Di- 
ameters above 2.00 and below 1.75 are to be 
avoided. 


Received November 25, 1955. 


References 


1. American Standard Safety Code for floor and 
wall openings, railings and toe boards. A 
12-1932. 

2. Liberty Mutual Insurance Company (Loss Pre- 
vention Dept.) Specification for Public Stair- 
way Safety and Production Data Sheet No. 
87. 








Journal of Applied Psychology 
Vol. 40, No. 6, 1956 


Personal History Data as a Predictor of Success in Service 
Station Management ’' 


Robert S. Soar 


Vanderbilt University 


Selecting the dealer to be given charge of a 
service station represents a serious problem 
to the oil company, yet one which remains 
little explored by scientific selection pro- 
cedures. The new dealer, after a period of 
training, is given responsibility for manage- 
ment of a physical plant and inventory in 
which the oil company has a large financial 
and good-will investment. If he fails, it re- 
quires several months for this to become defi- 
nite, and a twofold loss has occurred mean- 
while. 

In past studies of sales personnel, the two 
procedures that have been most consistently 
useful for predicting success in a variety of 
situations have been the personal history 
blank, objectively scored, and the Strong 
Vocational Interest Blank. Since the social 
and educational level of these dealers is such 
as to make crystallized patterns of interest 
less frequent among them than among the 
largely professional group on which the Strong 
blank was standardized, developing a scoring 
scheme for personal history items was chosen 
as the initial approach to the problem. 


Procedure 


The subjects were 29 dealers currently op- 
erating service stations in a metropolitan area 
of about 300,000 population, in a Southeast- 
ern state. The criterion data consisted of 
ratings by the dealers’ supervisors. Although 
an objective criterion, such as gallonage or 
dollar sales, would be preferable on many 
counts, it was felt to be inappropriate here 
because of the strong contaminating influ- 
ence of differences in location. On the other 
hand, ratings were probably sounder in this 
case than they frequently are. Five men su- 
pervised dealers in the area, with territories 


1 This study was made possible by Mr. Gilbert B. 
Dickey, Jr., President of Trans American Oil Com- 
pany (formerly American Oil Company of Tennes- 
see). 


systematically rotated so that all raters knew 
the work of each dealer, but personal attach- 
ments were less likely to have been built up 
strongly than would otherwise be the case. 
Ratings were also made by one man at the 
next higher level of management, so that all 
dealers were rated six times. In addition, 
since all the raters shared the problem of 
selecting and training new dealers, all were 
concerned with development of selection pro- 
cedures and motivated to rate carefully. 
The aspects of service station management 
on which the dealers were rated were those 
developed by discussion of the management 
staff as the elements they felt to be impor- 
tant. Altogether, 15 aspects of performance 
were rated for each man. It seemed impor- 
tant to rate various aspects of performance 
separately, since if some elements were inde- 
pendent of others, a criterion composite would 
be less meaningful than the independent meas- 
ures. However, in the light of the small 
number of subjects and the overlapping be- 
tween items it seemed impractical to analyze 
all the intercorrelations of the 15 criterion 
ratings, so these were grouped into clusters 
that seemed a priori to be related, and the 
intercorrelations between the clusters were 
calculated. The clusters were these: Busi- 
ness Sense, involving management and financ- 
ing; Promotion, involving merchandising and 
enthusiasm in selling; Emotional Maturity, 
involving stability and emotional control; Re- 
sponsibility, involving loyalty and willingness 
to assume self direction; and Personality, in- 
volving appearance and ability to inspire 
liking and confidence. The intercorrelations 
of these ratings are shown in Table 1. 
These intercorrelations were high enough 
to make analysis of the data for each aspect 
separately seem not worthwhile; accordingly, 
the ratings on various sub-aspects of perform- 
ance were totaled and this single rating used 
as the measure of success in service station 


383 





Robert 


Table 1 
Intercorrelations * Between Aspects of Rated Success 
in Service Station Management 


2 


. Business Sense 67 

. Promotion 

. Emotional Maturity 
. Responsibility 

. Personality 


*SE = 


+.19; P < .001 for all intercorrelations. 


management. The average interrater reli- 
ability, by way of Kendall’s coefficient of 
concordance (1), was + .80, significant be- 
yond the .001 level. Since the performance 
of individual dealers is likely to be a subject 
for discussion at sales meetings, this figure is 
undoubtedly inflated by what Thorndike (5) 
calls “local reputation,” but on the other 
hand these discussions are directed at capi- 
talizing on the strong points and dealing with 
the weak points of each dealer, so that they 
may also have contributed to the validity of 
the ratings. 

Since the company did not keep application 
blanks on file for employees, it was necessary 
for the personal history data to be collected 
from the dealers in terms of their status at 
the time they accepted dealerships. This in- 
troduces the possibility of memory error and 
distortion, although Keating, Paterson, and 
Stone (4) have indicated that neither mem- 
ory error nor distortion is as prevalent as 
might be expected, and in fact scarcely ex- 
isted in their sample. It should be noted 
that their subjects were job applicants so 
that some pressure to inflate past responsi- 
bilities and wages might be expected, whereas 
such pressure should be much less prevalent 
here, if it existed at all. 

Since no statistic was likely to give signifi- 
cant results on such small numbers unless the 
differences were extreme, the item analysis 
was carried out on a different basis than is 
usually employed, but one whose rationale is 
an extension of (or perhaps more accurately, 
extrapolation from) current item-selection 
theory. Katzell (3) has pointed out that 
requiring stringent significance levels in item 
selection is likely to result in the selection of 


S. Soar 


items in which chance positive variance is 
unusually large so that cross validation re- 
sults in extensive validity shrinkage. With 
large numbers of cases, the problem is less 
serious, but with smaller numbers (usually 
300 to 500 are mentioned) the problem be- 
comes serious. Katzell’s approach to solving 
the problem is to require less extreme signifi- 
cance levels as the number of cases becomes 
smaller, and to use a variant of double cross 
validation in which the sample is split in 
half, and only those items are retained which 
reach a relatively liberal level of significance 
in both halves of the sample. 

The procedure used here was an extrapola- 
tion from this latter procedure. The personal 
history blanks were put in rank order by 
total performance rating, and assigned alter- 
nately to two groups. Each group was then 
split into a high and low half, and the item 
analysis carried through separately on each 
group. Items were then retained or thrown 
out on the basis of whether they discrimi- 
nated in the same direction between high and 
low halves of each group, with no considera- 
tion given to whether the discrimination took 
place at a significant level. 


Results 


Of the 39 items of personal history col- 
lected, 14 were retained on the basis of this 
analysis to make up the scoring key. They 
were weighted either 2 or 1, depending on 
the degree of differentiation shown in the data 
for the total group. The items were as fol- 
lows—weighted 2: over 5'63” in height, no 
more than 200 Ibs. in weight, between 25 
and 39 years of age, held a blue-collar job 
while in high school (jobs involving long 
hours, outside work, and/or unpleasant work- 
ing conditions) including farming, no more 
than one child; weighted 1: two or more sub- 
jects listed as liked in school, two or more sub- 
jects listed as liked least in school, held office 
in high school organization, held job in high 
school involving some aspects of white- and 
blue-collar work (meat cutter, baker), wife 
not working, own home or paying on it, owe 
money, carry other insurance in addition to 
life insurance, have $500 or more in savings, 
work on own Car. 








Personal History Data as Predictor of Success 


Further data were then collected on an 
additional 23 subjects from two other, some- 
what smaller metropolitan areas in the same 
state. In addition to the 14 items retained 
in the scoring key on the basis of the first 
analysis, eight others were again tried out 
which had discriminated in one group, but 
had shown equal frequencies of responses 
from both high and low halves in the other 
group. 

In the cross-validation sample the ratings 
which were obtained were categorizations of 
the subjects into three degrees of success. 
The triserial correlation (2) of personal his- 
tory scores (based on 14 items) with rated 
success was + .47 (p< .05). The additional 
questionable items which were then reana- 
lyzed were found not to be discriminating in 
the new sample. 


Summary and Conclusions 


Ratings on 15 aspects of service station 
management were collected for 29 dealers in 
one metropolitan area. Intercorrelations of 
these ratings showed a single over-all rating 
to be appropriate. This over-all rating was 
then used as the criterion against which a 


personal history blank was item analyzed by 


a variant of double cross validation. The 


385 


items retained in the scoring key were then 
cross validated on a new sample of 23 dealers 
drawn from two other cities. 

These conclusions were drawn: 


1. A unitary criterion was adequate to de- 
scribe performance in service station man- 
agement. 

2. Of the 39 items studied, 14 were found 
to discriminate more successful dealers from 
less successful, and to retain validity with 
cross validation. 

3. An item analysis procedure based on 
double cross validation was found to be suc- 
cessful with a sample much smaller than usu- 
ally considered adequate for item analysis. 


Received March 5, 1956. 


References 


. Edwards, A. L. 
havioral sciences. 

. Jaspen, N. Serial 
1946, 11, 23-34. 

. Katzell, R. A. Cross-validation of item analysis. 
Educ. psychol. Measmt, 1951, 11, 16-22. 

. Keating, Elizabeth A., Paterson, D. G., & Stone, 
C. H. Validity of work histories obtained by 
interview. J. appl. Psychol., 1950, 34, 6-11. 

. Thorndike, R. L. Personnel New 
York: Wiley, 1949. 


Statistical methods for the be- 
New York: Rinehart, 1954. 
correlation. Psychometrika, 


selection. 





Journal of Applied Psychology 
Vol. 40, No. 6, 1956 


Interest Scores in Identifying the Potential Trade School Dropout 


Cecil O. Samuelson and David T. Pearson, Sr. 


The records of the Salt Lake Area Voca- 
tional School reveal that a considerable num- 
ber of its students drop out of school before 
completing a program of study. This is not 
a unique situation among post high school in- 


stitutions, but it does pose the serious ques- ~ 


tion as to why these dropouts occur. This 
modest inquiry was not intended to furnish 
an answer to this total question but was an 
effort to explore the facet of this problem 
that relates to differential interest patterns, 
if any, among trade school students. 

While it is recognized that the factors that 
relate to dropouts are many and varied, it 
was thought that an interest inventory might 
reflect these factors. Specifically, the intent 
was to determine whether the group that 
stayed in school to complete a course could 
be distinguished from the group that dropped 
out before completion on the basis of pat- 
terns of interest as measured by the Kuder 
Preference Record, Form CH. 

Inasmuch as all the training departments 
included in this study were basically mechani- 
cal in nature, it was hypothesized that stu- 
dents successful in the sense of completing 
the course would generally score high in the 
mechanical section of the inventory, while 
the interest patterns of those who dropped 
out of school before completion would show 
a different sort of profile. Since the Kuder 
Preference Record was available on the stu- 
dents in this school, it was decided to investi- 
gate the problem through the use of this in- 
strument. 


The Sample 


The sample on which this study was made 
consisted of two groups of students—55 who 
dropped out before completion and 48 who 
completed courses; these were the total num- 
ber in both categories on whom Kuder Pref- 
erence Record Scores were available for the 
period covered by the study. All of those in 
both groups were from the auto mechanics, 
auto body and fender, diesel mechanics, car- 
pentry, electricity, electronics, machine shop, 


and welding departments of the school. The 
first group of 55 consisted of those who had 
dropped out of the trade training programs 
at the Salt Lake Arda Vocational School dur- 
ing the academic years 1953-54 and 1954-55. 
Those who dropped out to take employment 
or enter apprenticeship programs in the field 
for which they were training were not in- 
cluded in this group. 

The second group of 48 consisted of those 
who either completed their programs during 
this period or who dropped out to enter em- 
ployment or apprenticeship training in the 
same field for which their vocational school 
programs were preparing them; 34 completed 
their training while 14 terminated formal 
school training to enter employment training 
situations in the same field. 

Both groups were composed entirely of 
males, and, although no effort was made to 
pair the groups, they are roughly similar as 
concerns age, marital status, and amount of 
education. 


Limitations 


Of the various limitations to a study of this 
sort, attention will be drawn to only the more 
obvious. First, there is the limitation of the 
instrument itself; it is assumed that this point 
needs no elaboration here. A second limita- 
tion concerns the sample. The school does 
not require tests as a part of the admissions 
procedure but does give a test battery peri- 
odically which all students must take before 
they are considered to be fully registered. In 
this particular institution students may enter 
at any time, so the procedure is to test in one 
group all those who have entered or applied 
for admission since the last administration of 
the test battery. This means that in some 
instances students may have been in school 
some time before they were tested; most stu- 
dents, of course, would be tested before or 
as soon as they entered school. The actual 
situation in this study is that 44% of the 
total group took the tests on or before the 
day they entered school. Nine per cent took 


386 








Interest Scores 


the tests within the next ten days after en- 
tering school; 8% took the tests within the 
second ten-day period; while 36% took the 
tests within the third ten-day period after 
registration. Four students took the tests 
after they had been in school more than one 
month. 

One facet of this point is that it is not 
known what effect these variable amounts of 
training may have had on these interest 
scores. A second facet in this connection is 
that some students enter school and drop out 
before they have been tested; this group is 
very small and was not included in the study, 
but this does represent a bias in sampling 
which should be noted. 

Also, it must be recognized that both of 
these groups may already be somewhat se- 
lected in favor of an interest in mechanics. 
Presumably, few people would enroll in a 
mechanical course without at least some feel- 
ing that they would like to do that type of 
work. Furthermore, some of these students 
have had work experience before coming to 
school and consequently have some first-hand 
experience to support their decisions to enter 
vocational school in the first place. It would 
be expected that those whose felt inclinations 
were not mechanical would not have enrolled 
in these departments in the first place. 


Discussion 


The V scores on all of the individual inven- 
tories included in this study were within the 
limits prescribed in the published instruc- 
tions. Presumably, then, these students un- 
derstood the directions, had sufficient intelli- 
gence to comprehend the inventory items, 
and actually marked the answer sheets in an 
acceptable manner. 

To facilitate consideration of the inventory 
scores, the differences between the means of 
the two groups were computed in each of the 
inventory categories. Also, composite pro- 
files were made for each of the two groups 
separately which were then plotted on the 
same profile sheet. From this vantage point 
the following observations were made: 

First, it was apparent that there were no 
great differences between the two groups in 
any of the test categories. While there was 





~ 
w 
6 





Mechanical 
Computationay » 
Scientific 
Persuasive 
Artistic 
literary 

Social 

Service 

Clerical 








' 





=- 
~ 


— 
> —. 
- - 
— 





























+ se 
“Des. 














0---—---~--—-—- Completions 


Fic. 1. Composite profiles of those who com- 
pleted trade courses and those who dropped out be- 
fore completion. 


some variability, of course, none of the dif- 
ferences between the means, with the single 
exception of category 8, Social Service, was 
significant; and this single exception was sig- 
nificant at only the .05 level. 

A second observation was the relatively flat 
profiles shown by both groups, as indicated 
in Fig. 1. This would tend to suggest the 
idea that the interests of trade school stu- 
dents as reflected by this inventory are broad 
and diverse rather than being highly crystal- 
lized in the mechanical area. Also, the dif- 
ferences, slight as they were, did not suggest 
distinctive patterns. While we cannot be cer- 
tain that the 75th percentile, suggested by 
Kuder as representing a point of significant 
interest intensity, has meaning in this par- 
ticular connection,.even by that standard the 
profiles lack points of distinctive strength. 
It will be noted that the single exception 
from this point of reference was the me- 
chanical score for the group that completed 
courses, and even that score was just barely 
in the so-called significant area—the 77th 
percentile. It might have been expected that 
the mechanical category would be the point 
of greatest interest for the completions, but 
it would also have been expected that the 





388 


strength shown in this area would be quite 
decisive rather than the fairly mild sugges- 
tion indicated. 

Still using Kuder’s 75th percentile as a 
point of reference, it may further be noted 
that while the combining of scores is useful 
in considering total samples such as these, 
the process does obliterate the observable 
variability in individual scores. For example, 
using the mechanical category on the inven- 
tory with the group that completed courses, 
seven of the 48 had scores that would place 
them below the 50th percentile, while only 
27 of the 48 were above the 75th percentile. 
The extreme caution with which these scores 
must be used becomes apparent when it is 
remembered that all of these 48 students 
successfully completed courses in these areas 
of mechanics; yet 44% of them had me- 
chanical interest scores below the recom- 
mended 75th percentile, while about 15% of 
them had scores below the 50th percentile. 

Attention should also be drawn to yet an- 
other pertinent observation. 
previously noted that the 48 in the one group 


It has been. 


Cecil O. Samuelson and David T. Pearson, Sr. 


consisted of 34 who had completed their 
courses in school and 14 who had dropped 
out of school before graduation to enter em- 
ployment training in situations for which 
their school work had been preparing them. 
It was thought that these 14 might possibly 
be a separate group in terms of interests and 
the data were so analyzed. However, no sig- 
nificant differences were revealed. 


Conclusions 


Consideration of the data presented here 
would seem to support these conclusions: 

1. The present use of the Kuder Preference 
Record in these trade training departments of 
this vocational school is of very limited value 
in helping students evaluate their decisions to 
become mechanics. 

2. On the basis of this study and within 
its limitations, the Kuder is not helpful in 
distinguishing between those who will com- 
plete mechanical courses and those who will 
not. 


Received December 27, 1955. 








Journal of Applied Psychology 
Vol. 40, No. 6, 1956 


The Naval Knowledge Test 


Albert S. Glickman 


U.S. Naval Personnel Research Field Activity, Washington 1 


The Naval Knowledge Test (NKT) arose 
from a need to measure motivation to excel 
at the Navy’s Officer Candidate School. It 
assumes that, other factors being constant, 
one who has accumulated more naval knowl- 
edge prior to enrollment at OCS, is more in- 
terested in naval matters, and will conse- 
quently find it easier to meet the work 
demands of OCS, will be less distracted by 
petty annoyances, and be willing and able to 
devote more energy to the serious pursuit of 
achievement at OCS. 


Description 


The first experimental form of the NKT 
consisted of 136 items arranged under the 
following headings: 

1. Identification or definition of names, 
words, phrases, slang, and symbols (53 
items) ; 

2. Information about ships that have been 
prominent in naval history (19 items) ; 

3. Events and locations that have been 
prominent in naval history (20 items); 

4. Prominent naval personalities, past and 
present (17 items) ; 

5. Knowledge of naval organization and 
practices (27 items). 

These sections were not considered as sub- 
tests but only as organizational clarifiers. 

Administration time was 45 minutes— 
enough for practically all subjects to com- 
plete the test. 


Construction 


In writing items for the test the author ad- 
hered to requirements that: 

1. The subject matter of the items should 
bear upon matters pertaining to the Navy, 


1This research was conducted while the author 
was with the American Institute for Research, work- 
ing under contract Nonr 890(01) with the Office of 
Naval Research. This article draws from contrac- 
tor’s reports on NKT construction and validation 
prepared as Bureau of Naval Personnel Technical 
Bulletins (2, 3). The viewpoints expressed herein 
are not to be construed as those of the U. S. Navy. 


past and present, and to the seas and ships 
in general. 

2. The information or knowledge required 
to answer the questions correctly should not 
be of a highly technical nature, but should 
be of the sort fairly readily available in the 
“public domain,” so that acquisition of such 
knowledge and information might be consid- 
ered to represent a manifestation of interest 
and motivation rather than special oppor- 
tunity or training of a naval or seafaring sort. 

These steps were followed in the writing of 
the items: First, the author developed ideas 
for, and wrote items on, a “free association” 
basis. That is to say, he drew from his own 
store of “naval knowledge” for material for 
items. Second, additional material for items 
was sought in various naval histories, in 
standard naval and maritime texts and hand- 
books, and in the Bureau of Naval Personnel 
magazine, All Hands. Third, reference was 
made to the file of Naval Knowledge items 
used in subject matter tests by the Orienta- 
tion Section of the Officer Candidate School 
as another source of ideas. 

After all the items had been put in the 
form of five alternative, multiple-choice ques- 
tions, they were submitted to several naval 
officers for comment or correction, to insure 
that they were in conformity with fact and 
naval usage. 

In order to promote clarity and facilitate 
instructions to the subjects the test items 
were sorted into the five classifications indi- 
cated above. 


Validation 


Population. The test was administered to 
three companies of Class 13 at the Officer 
Candidate School in the first week at the 
school, during which period the students are 
processed, tested, and organized for adminis- 
trative purposes. They do not attend any 
classes during this week, and so are only in- 
cidentally exposed to “naval knowledge” be- 
fore taking the test. The numbers of cases 


389 





Albert S. Glickman 


Table 1 


Correlations of Naval Knowledge Test Scores with 
OCS First Quarter Academic Sum 








Class 13 N 


Company E 165 
Company F 161 
Company G 133 





were as follows: Company E—165, Company 
F—161, and Company G—133. 

Criterion. The criterion employed in this 
investigation consisted of the sum of quar- 
terly grades (covering the first month of 
training) for the six major areas of academic 
training at OCS. This was considered to be 
a practical criterion in that about one-half of 
the cases of disenrollment at OCS take place 
at the end of the first quarter, action being 
based in largest part upon academic perform- 
ance during this period. 

The validity of the test. Correlations were 
obtained between NKT scores and the cri- 
terion. These were computed separately for 
each company. These correlations are found 
in Table 1. 

The results, to this point, indicated that 
the coefficient of validity for the NKT, in 
terms of predicting early academic achieve- 
ment at OCS was of sufficient magnitude to 
warrant further study. 

Multiple prediction of criterion. In prac- 
tical terms, the utility of the NKT as a se- 
lection instrument is dependent upon the de- 
gree to which it taps something other than 
that which is assessed by the Officer Qualifi- 


cation Test (OQT), which is the principal 
test employed in the screening of OCS ap- 
plicants from civilian life by the Office of 
Naval Officer Procurement (ONOP).* 

In order to establish whether the NKT 
could materially improve selection (as dem- 
onstrated by prediction of performance at 
OCS), multiple correlations were computed 
for all cases in our sample for whom OQT 
scores were available.* 

Using ,First Quarter Academic Sum as the 
criterion, intercorrelations were obtained as 
indicated in Table 2.* 

It appeared from the differences between 
the zero-order correlations of OQT with Aca- 
demic Sum and the multiple-correlation re- 
sulting from the addition of the NKT score 
to the battery (Table 2, Column E), that the 
increase in predictive efficiency was substan- 
tial. 


These findings indicated that the NKT 
might provide an instrument for improving 


2 It developed that Class 13 was drawn almost en- 
tirely from nonfleet sources. Less than 1% of the 
membership of this class had been drawn from the 
Navy’s enlisted personnel. (Fleet personnel often 
constitute as much as one-third of a class.) Thus 
the results are interpretable as applicable to a group 
having no previous experience in the Navy. 

%For the student populations reported here, a 
minimum Navy Standard Score (NSS) of 40 had 
been one of the prerequisites for acceptance as an 
Officer Candidate. (OQT mean NSS = 50, standard 
deviation = 10; based on a population of U. S. col- 
lege graduate applicants to ONOP [1].) 

4 Attrition in the N from the original sample con- 
sists of cases of personnel with previous Navy ex- 
perience (who do not take the OQT) and others 
who, for various reasons, did not have the OQT 
scores in their records. 


Table 2 


Intercorrelations Between Criterion (First-Quarter Academic Sum), and Predictors (Naval Knowledge Test 








Vs. 
Class 13 Academic 
Company E 527 
Company F 377 
Company G 328 


Average rs** 





* See footnote 4. 
** Obtained by Fisher's r to z transformation. 


and Officer Qualification Test), and Multiple-Correlation Coefficients * 


Col. B Col. C Col. D 


Col. E 


Difference 
Between 


vs Ss. Multiple 
R Cols. Cand D 


OQT Academic 
408 627 .694 057 
.284 Al4 A9S O81 

499 543 044 


aig 519 584 065 











The Naval Knowledge Test 


391 


Table 3 


Intercorrelations Between Criterion (First-Quarter Academic Sum), and Predictors (Naval Knowledge Test 


and Officer Qualification Test), and Multiple-Correlation Coefficients 





Col. A 


NKT 
vs. 

Class 15 N Academic 
Company A 133 349 
Company B 134 .528 
Company C 141 391 
Class 15 

average rs 426 
Class 13 

average rs A414 
Average r differences, 

Class 15—Class 13 012 














Col. B Col. C Col. D Col. E 
NKT obT Difference 
Vs. Vs. Multiple Between 
OQT Academic R Cols. Cand D 

197 .548 601 053 
352 546 .653 107 
.293 566 .613 .057 
.282 553 .623 .070 
312 519 584 .065 
— .030 .034 .039 .005 





efficiency of screening OCS applicants.’ 
Hence, steps were taken to refine the test 
and to investigate the validity of the revised 
form. 


Construction of a Short Form of the NKT 


As a first step toward increasing the effi- 
ciency of the NKT as a potential part of an 
OC selection battery, an item analysis was 
performed. The aim was to select for a 
shortened test those items having highest cor- 
relation with the criterion (Academic Sum), 
and which demonstrated measurement of fac- 
tors other than those already measured by 
the OQT by having no greater correlation 
with the OQT than with the Academic Sum. 
To achieve this, each item was separately 
correlated with OQT score and with Aca- 
demic Sum. Item difficulty estimates were 
also computed. On the average, items were 
correctly answered by 59.6% of the sample.® 

Items were chosen for a short form which: 
(a) had correlations between item and Aca- 
demic Sum of .10 or greater, and (0) corre- 
lated as highly with Academic Sum as with 
OQT, or more highly with Academic Sum 


5 Parallel results could be anticipated using Second 
Quarter (mid-course) grades, inasmuch as the range 
of intercorrelations of First and Second Quarter 
grades for students still enrolled at the latter time 
over seven companies of Class 13 had been found 
to be .95 to .97 (not corrected for attrition in num- 
ber of cases and restriction in range due to disen- 
rollments from the fourth to the eighth week). 

6 Detailed item-analysis results are reported in (2). 


met by just over half (69) of the items. On 
the average, 59.9% of the sample gave the 
right answer to these items. Each of the five 
different classifications of items of the origi- 
nal form contributed about the same propor- 
tion of items to the 69-item key. 


Generalized Validity of the 69-Item Key 
for the NKT 


Since it appeared that, at least in the case 
of civilian applicants, use of the NKT as a 
supplement to the OQT held considerable 
promise for improving the selection of officer 
candidates for the Navy, the original form 
of the NKT was used for follow-up studies 
at the Officer Candidate School on a new 
class (Class 15), and scored using the 69- 
item key. 

Population. As in the original validation 
the test was administered in the first week of 
school. Data analysis was restricted to the 
population of students (with OQTs) procured 
from civilian sources. Three companies of 
Class 15 supplied the following numbers of 
cases: Company A—133,:Company B—134, 
and Company C—141. 

Criterion. The criterion used was the same 
as for the earlier analysis, First Quarter Aca- 
demic Sum. 

Validity of the 69-item key. When the key 
of 69 items derived by analysis of Class 13 
responses was applied to the three companies 
of Class 15 the correlations listed in Column 
A of Table 3 were found. Reference to the 





392 


average rs‘ found in the two classes shows 
that the 69-item key used for the scoring of 
Class 15 does as good a job of prediction as 
the full 136-item test did for Class 13. 

Multiple prediction of criterion. Once 
again multiple correlations were run, using 
the OQT and the NKT to predict academic 
grades. The pattern of results from Class 15 
data corroborates the Class 13 findings, as 
can be seen by further inspection of Table 3. 
It may be noted (Column C) that the OQT 
validity coefficients are a bit higher for the 
Class 15 sample (.034 on the average) and 
that the multiple R is higher (.039) by much 
the same amount (Column D), while the cor- 
relation between NKT and OQT (Column 
B) is slightly less than before (— .030). 
The corresponding values (in Column E) 
show that the 69 items of the NKT continue 
to add an increment of .070 to the validity 
coefficient obtained by use of the OQT alone. 
Therefore, the percentage of the criterion 
variance accounted for increases from about 
31% to 39%, or 8%. This compares with 
the .065 validity increment contributed by 
the whole test in the Class 13 analysis. 


Discussion 


On the basis of experience with the OCS 
samples reported upon here, it appears that 
a form of the NKT of about 70 items, re- 
quiring about 25 minutes administration time 
can contribute a substantial increment of 
predictive efficiency to the screening of ci- 
vilian applicants for the Navy’s Officer Can- 
didate School, beyond that obtainable through 
the use of the OQT as a single selection in- 
strument.® 

Summary 


The Naval Knowledge Test rests on the 
assumption that the civilian who accumulates 
more knowledge ab»ut the Navy and mari- 
time matters in gener:l before applying for 
the Navy’s Officer Candidate School is the 
person who will be more strongly motivated 
toward achieving academic excellence at the 
School. 


7 Obtained by z transformation. 
8 Experimental administration of the NKT to OCS 


applicants is taking place at ONOP branches. Fur- 
ther analysis is being conducted by the Navy to see 
whether validity holds up under operational condi- 
tions of administration and use. 


Albert S. Glickman 


The original form of the test contained 136 
items dealing with: identification or defini- 
tions of names, words, phrases, slang, and 
symbols; information about ships that have 
been prominent in naval history; events and 
locations that have been prominent in naval 
history; prominent naval personalities, past 
and present; and knowledge of naval organi- 
zation and practices. 

The test was administered to samples of 
new officer candidates, before they began 
academic work at OCS, who had no previous 
active duty in the Navy. When the NKT 
was included in a battery with the Officer 
Qualification Test, currently the principal 
screening instrument, prediction of academic 
grades obtained during the first month of 
school was appreciably improved over that 
obtained with the OQT alone. 

By item analysis, 69 of the original items 
were chosen which showed best ability to 
predict academic success while holding over- 
lap with the OQT to a minimum. These 
were drawn in about equal. proportions from 
the original five types of items. 

In order to check on the generalization of 
validity, the NKT was administered to an- 
other sample of “naive” officer candidates in 
a new incoming class and scored with the 
69-item key. The shorter form of the NKT 
predicted academic achievement as well as 
the original form had done and added the 
same increment of predictive efficiency when 
used in combination with the OQT. 

There are in progress further. studies of the 
validity and practicality of the NKT when 
used in an operational setting, directly in- 
volving OCS applicants. 


Received February 16, 1956. 


References 


1. Bureau of Naval Personnel. The Navy Officer 
Qualification Test, Forms 4, 5 and 6: I. De- 
velopment and standardization. Washington, 
D. C.: Bureau of Naval Personnel, Research 
Division, 1952 (NavPers 18318). 

2. Glickman, A. S. The Naval Knowledge Test: 
construction and validation. U. S. Bur. Nav. 
Personn., Tech. Bull., 1954, No. 54-7. 

3. Glickman, A. S., & Vallance, T. R. Development 
and validation of an experimental battery to 
select officer candidates for the Navy. U. S. 
Bur. Nav. Personn., Tech. Bull., 1954, No. 
54-12. 











Journal of Applied Psychology 
Vol. 40, No. 6, 1956 


Development of a Structured Disguised Personality Test * 


Bernard M. Bass 


Louisiana State University 


The ease with which applicants for busi- 
ness and industrial positions can fake tradi- 
tional personality inventories has stimulated 
the attempt to develop distortion-free tests. 
Since 1945, interest has focused on forced- 
choice procedures as a solution to the prob- 
lem, yet, in an unpublished study, we have 
found that sales applicants can readily fake 
certain types of forced-choice inventories such 
as the Gordon Personal Profile. Travers (8) 
has discussed the possible ubiquitous fakabil- 
ity of forced choice. 

Structured disguised personality tests rep- 
resent another approach to solving the prob- 
lem. Such tests combine the measuring 
properties of projective techniques with the 
objective scoring of inventories. Various 
types have been proposed such as error- 
choice, sentence completion, word or para- 
graph interpretation, mutilated figures, and 
word association. Campbell (3) has sur- 
veyed the application of these techniques to 
assessing attitudes. 

The present study aimed to develop and 
evaluate a multiscale proverbs * test to assess 
selected personality variables deemed signifi- 
cant for occupational success. In 1935, Mur- 
ray’s Explorations in Personality (6) briefly 
presented lists of proverbs whose acceptance 
or rejection had been used to assess various 
personality needs. Little evidence was in- 
cluded concerning validity or reliability of 
this process. Recently, Baumgarten (1) re- 
ported using proverb selection as a means of 
assessing worker attitudes. Psychiatrists have 
long used proverb interpretation to assess 
intellectual functioning. 


1 This study was aided by a grant from the Louisi- 
ana State University Graduate Council on Research. 
The author was assisted in the analyses by Ki Suk 
Kim, Charles H. Coates, and George Palmer. He 
wishes to thank Donald T. Campbell, Cecil Gibb, 
Gerald McCullough, and Arnold Gebel for their help 
in data collection. 

2 The term “proverb” will be used loosely to in- 
clude a variety of statements such as maxims, adages, 
apothegms, aphorisms, and sayings. 


The Lists of Proverbs 

With Murray’s classification of needs as a 
guide, 13 a priori lists of 20 proverbs each 
were constructed from selected sources in- 
cluding the lists in Explorations in Person- 
ality, Bartlett’s Familiar Quotations (2), Rich- 
mond’s Modern Quotations (7), and private 
lists of Louisiana Negro proverbs. Forty ad- 
ditional proverbs were in the first inventory 
but were not scored. The lists were inter- 
mixed in the inventory of 300 proverbs pre- 
sented to the examinees. The directions were 
substantially as follows: 

“This is a test of your attitudes toward 
various famous sayings. Read each one care- 
fully to find its true meaning to you. Indi- 
cate whether you agree, disagree, or are un- 
certain about the statement. If you cannot 
make up your mind, it will help if you ask 
yourself if you believe the statement is usu- 
ally true or usually false.” 

On each scale, two points were assigned re- 
sponses of “yes,” one point was assigned “?” 
responses, and no points were assigned for 
“no” responses. Simple totals were based on 
sums of points assigned for a given scale. 

One example proverb from each of the 13 
scales is as follows: 


1. Material Comfort. (All the money in the world 
is useless if you can’t spend it just as you like.) 

2. Sex. (It is not very difficult to fall in love.) 

3. Harm Avoidance. (The doors of death are ever 
open.) 

4. Achievement. (To obtain success by your own 
efforts is the greatest joy in life.) 

5. Affiliation. (There is no satisfaction without a 
companion to share it.) 

6. Deference. (In matters of conduct it is best to 
conform to custom.) 

7. Autonomy. (It is best to stand alone when in 
trouble.) 

8. Aggression. 
weakness.) 

9. Abasement. 
inner worth.) 

10. Rejection. 

11. Nurturance. 
ceiving.) 


(To forgive an enemy is a sign of 
(Outside show never makes up for 


(Never trust a flatterer.) 
(Giving is always better than re- 


393 





Bernard M. Bass 


Table 1 


Tetrachoric Intercorrelations Among 13 Famous Sayings’ Scales and Personal Characteristics of Subjects 
(N = 400) 











. Material Comfort 
Sex 

. Harm Avoidance 
. Achievement 

. Affiliation 

. Deference 

. Autonomy 

. Aggression 

. Abasement 

. Rejection 

. Nurturance 

. Superego Strength 
. Irritability 

Age 

Male vs. Female 
Education 
North vs. South 


SOMIAHNPEWNE 


9 10 11 Age M-F Educ N-S 
46 —15 -02 -21 —05 
24 —27 09 —36 
48 57 -—03 -12 —09 
49 45 —10 06 —28 
60 23 -07 -—13 —05 
61 42 -15 -0O1 -—13 
25 48 -10 -—21 —-—25 
25 51 -07 -06 -—07 
67 51 -—-19 -18 —26 
18 43 -10 -02 —23 

44 -21 -21 —08 
42 -08 —20 
—21 —14 


c 


15 





Note.—Decimal points omitted. 


12. Superego Strength. (No degree of temptation 
justifies any degree of sin.) 

13. Irritability. (Only a statue’s feelings are not 
easily hurt.) 


Factor Analysis 
Subjects 


The 300-item form was administered to ap- 
proximately 2,000 cases in a variety of sam- 
ples drawn from different segments of the na- 
tional population. These included Southern 
department store saleswomen and _ peniten- 
tiary inmates, high school seniors in New 
Hampshire, student nurses in the same state, 
adult residents of Los Angeles, supervisors of 
a Louisiana petroleum refinery, Midwestern 
and Southern college students, Louisiana pub- 
lic school teachers, rural Southern high school 
sophomores and seniors, Marine Corps en- 
listed men, Chicago cosmetic salesmen, etc. 

From these 2,000 cases, a sample of 400 
was constructed which represented as well as 
possible, under the circumstances, the Ameri- 
can population which would be most likely 
to be assessed routinely by the final test, 
should such a test prove useful.° The 400 

3It was assumed that a personality inventory 


would--most commonly be used for screening ap- 
plicants for professional, managerial, and technical 


subjects had a mean age of 26.5, with a stand- 
ard deviation of 10 years. The mean educa- 
tion was 14.5, with a standard deviation of 
1.7 years. Forty-six per cent of the subjects 
were from the North, Midwest, and West 
Coast while 54% were from the South. Sixty 
per cent were male and 40% were female. 


Results 


Table 1 shows the tetrachoric intercorrela- 
tions among the 13 scales and the demo- 
graphic data (age, sex, education, and geo- 
graphic region). A multiple centroid factor 
analysis was performed on this matrix of in- 
tercorrelations. Table 2 shows the final ro- 
tated factor matrix following seven rotations 
of the five factor axes.* 


occupations. We attempted a crude miniature re- 
production of such a sample. Entry into such occu- 
pations usually occurs after some college training and 
during the ages 20 to 35. More are male than fe- 
male. Approximately 60% live in the North. Lack 
of opportunity forced us to include in the sample 
more Southerners and more females than originally 
desired. 

* The unrotated factor matrix and the final trans- 
formation matrix have been deposited with the 
American Documentation Institute. Order Document 
No. 5042 from the ADI Auxiliary Publications Proj- 
ect, Photoduplication Service, Library of Congress, 
Washington 25, D. C., remitting in advance $1.25 





Structured Disguised Personality Test 


Table 2 


Final (V;) Rotated Factor Matrix 


Factor 





I 
(Conventional 
Mores) 


II 
Variable (Hostility) 


III 
(Age- 
Education) 


IV V 
(Fear of (Sampling 
Failure) Imbalance) 





. Material Comfort 
. Sex 
. Harm Avoidance 
. Achievement 
. Affiliation 
. Deference 
. Autonomy 
. Aggression 
. Abasement 
. Rejection 
. Nurturance 
. Superego Strength 
. Irritability 
Age 
Sex 
Education 
North vs. South 


39 
16 
44 
31 
70 
.66 
.06 
16 
.68 
10 
85 
at 
A5 


49 
57 
28 
18 
15 
Al 
65 
76 
.22 
64 


.06 
49 
.00 
06 
— .06 
.09 


Factors 


It was fairly easy to find meaning in the 
final solution and to label the factors accord- 
ingly. The factors and variables most highly 
loaded on each were as follows: 


Factor I: Conventional Mores 
11. Need for Nurturance 
12. Superego Strength 

5. Need for Affiliation 
9. Need for Abasement 
6. Need for Deference 


Loadings 
85 
47 
70 
.68 
66 

Factor II: Hostility 
8. Need for Aggression 
7. Need for Autonomy 

10. Need for Rejection 

Factor III: Age-Education 

Education 
Age 


81 
71 


Factor IV: Fear of Failure 
4. Need for Achievement 84 


62 


3. Need for Harm Avoidance 


The last factor, V, concerned sex and ge- 
ography and was a consequence of accidental 
for microfilm or $1.25 for photocopies. Make checks 


payable to Chief, Photoduplication Service, Library 
of Congress. 


—.10 
—.39 

O1 
— .04 
.00 
09 
12 
00 
.23 
12 
—.14 

.03 
—.19 

71 
— .04 

81 
—.31 


42 —.21 
30 —.29 
.62 —.09 
84 03 
24 — 05 
—.14 
37 
12 
19 
—.01 
fe | 
Al 
04 
— .06 
— 44 
—.04 
Al 


30 
18 
18 
34 
14 
.O8 
.26 
— .04 
—.21 
—.25 
.22 


~ 


sampling in which Southern males and North- 
ern females were slightly overrepresented.® 
However, the three test factors, I, II, and IV, 
were independent of both demographic fac- 
tors, III and V. 

The three test factors markedly resemble 
clusters of items independently isolated by 
Cook and Medley (4) in a study of the Min- 
nesota Multiphasic Personality Inventory—a 
traditional, undisguised test. They found that 
teachers, dichotomized according to their self- 
reported ability to get along with pupils, also 
varied on the MMPI in their hostility toward 
others, in their adherence excessively to rigid 
standards of morality and in their pride in a 
thorough knowledge of subject matter. The 
“hostility” and “pharisaic-virtue” scales de- 
veloped by Cook and Medley appear to in- 
volve similar content to our Factor II and 
Factor I, while “pride in knowledge” has some 


* Initially, there was a tetrachoric correlation of 
— .22 between “maleness” and “northernness” due 
to the fact that more of the males in our study were 
Southerners and more of the females were Northern- 
ers. The factor was labeled “Sampling Imbalance.” 
The original test scales were uncorrelated with this 
factor to any large degree. Inspection of Table 1 
suggests that sex and geographic region were inde- 
pendent of all other variables except each other. 





396 


resemblance to the fear of failure factor we 
isolated. 

The high loading of “harm avoidance” on 
Factor III conforms to McClelland’s (5) find- 
ing that some subjects are motivated to 
achieve by hope of success, others by fear of 
failure. 


Item Analysis 


The next step was to develop by item 
analysis a scale to measure each factor. 

A new sample of 200 subjects was drawn 
from the original pool of 2,000 cases. The 
mean age of the new sample was 21 years 
with an SD of 3.5 years. It had a mean edu- 
cation of 14.0 years with an SD of 1.5 years. 
Half the sample was from the South, the 
other half from other regions of the country 
while 55% were male and 45% female. 

Pooled scores from original scales—11. 
Nurturance and 12. Superego Strength— 
served as a crude measure of conventional 
mores. Scale 8 was used as a first approxi- 
mation of hostility and Scale 4 was used as a 
first approximation of a fear of failure scale.® 

For each of the three crude scores thus de- 
rived, the sample of 200 was trichotomized 
into an upper scoring 25%, a middle scoring 
50%, and a lower scoring 25%. For the item 
analyses, the tendency to respond “Yes” of 
the upper and lower 25% of the distributions 
was compared since about half of all re- 
sponses fell into this category while about 
half were “?” or “No.” 

An inspection of percentage differences per- 
mitted the discrimination and selection of 
items for each scale which correlated posi- 
tively with that scale and relatively were in- 
dependent of performance on the other two 
scales. 

For example, consider the selected item 
“Meekness is better than vengeance,” which 
was “accepted” by the criterion groups as 
follows: 


6 For a discussion of crude versus accurate meth- 
ods of estimating factor scores, the reader is re- 
ferred to R. B. Cattell, Factor analysis. New York: 
Harper, 1952, p. 80. In the present situation, the 
selected scales correlated so highly with the factors 
they were to represent, it was believed that little 
would be gained in employing multiple-regression 
procedures to optimally weight the scales to yield 
a maximum correlation with each factor. 


Bernard M. Bass 


Criterion 





Conven- 
tional 
Mores 


Fear of 
Hostility Failure 


Criterion Upper 25% 76% “4%  OO% 
Group Lower 25% 36 56 44 
% Difference +40 —12 +16 


This item was included in the final scale of 
conventional mores because it discriminated 
the “highs” from the “lows” on that scale but 
not the others. Most items selected were ac- 
cepted by 40% or more of the upper com- 
pared to the lower criterion group on the 
scale for which the item was selected, and by 
less than 20% more of the upper than the 
lower groups on the other scales.’ 

Thirty items were selected from the 300 for 
the C or Conventional Mores scale, 30 for the 
H or Hostility scale, and 20 for the F or Fear 
of Failure scale. 


Reliability and Interrelations 


A new sample of 100 subjects previously 
unused was drawn from the original pool of 
2,000. This sample averaged 28.2 years in 
age with an SD of 12.8 years. Similar to 
preceding samples, its mean amount of edu- 
cation was 14.0 years with an SD of 2.2 years. 
Forty-six per cent were Southerners, and 49% 
were male; 54% were from the North or 
West, and 51% were female. 

For the sample, the intercorrelations found 
among the scales were: rey = .45, rer = .54, 
rey = 48. Corrected split-half reliabilities 
were obtained as follows: Conventional Mores, 
.83; Hostility, .72; and Fear of Failure, .69. 

An analysis of the reliability of the scales 
and the intercorrelations among them was 
performed for an available more homogeneous 
sample of 147 Louisiana Penitentiary in- 
mates. The inmates averaged 29.3 years in 
age with an SD of 7.9 years. Mean educa- 
tion was 10.5 years with an SD of 1.3 years. 
The sample was all male and almost totally 
Southern in origin. 


7A maximum value of .1 is possible for the stand- 
ard error of the difference between proportions in 
two samples of 50 cases each. Therefore, a differ- 
ence of .4 would be at least four times the standard 
error of the difference, likely to occur on a chance 
basis much less than 1% of the time. 





Structured Disguised Personality Test 


For this penitentiary sample the intercor- 
relations found were: roy = — .21, rer = .10, 
tru = — .30. Corrected split-half reliabilities 
were: Conventional Mores, .72; Hostility, 
.58; and Fear of Failure, .45.8 

A revised inventory of 90 items was pre- 
pared which contained 30 items for each of 
the three scales. Ten of the Fear of Failure 
items were newly written.’ For a sample of 
Louisiana State college sophomores, the in- 
tercorrelations among these scales were as 
follows: rey = — .12, rep = .32, ren = 42. 
The corrected split-half reliabilities were: 
Conventional Mores, .73; Hostility, .69; and 
Fear of Failure, .75. 


Summary 


To develop a disguised but objective per- 
sonality inventory, a factor analysis was per- 
formed on scores based on 400 examinees’ 
tendencies to accept or reject 13 lists of 
proverbs constructed to cover 13 areas. The 
three test factors which emerged following 
seven rotations were: Conventional Mores, 
Hostility, and Fear of Failure. Using 200 


new examinees, scales were constructed by 
item analysis to measure each. 


In subse- 


8 Assuming that inmates of a state penitentiary 
are more homogeneous than a naturally scattered 
sample of men and women, these results illustrate 
that the reliability and independence of factors are 
a function of the homogeneity of samples used. 
The more homogeneous the sample, the lower is 
factorial reliability and the higher is factorial inde- 
pendence. Reliability and orthogonality are relative 
to sample homogeneity. 

® The ten items were added in an attempt to in- 
crease the reliability of the Fear of Failure scale. 


397 


quent samples, the three scales were found to 
have corrected split-half reliabilities ranging 
from .45 to .83 and intercorrelations ranging 
from — .12 to .54. The reliabilities and in- 
tercorrelations among the scales were higher 
when the groups were more heterogeneous in 
background. 

The reliabilities and intercorrelations among 
the scales suggest that three separate behav- 
ioral tendencies are being assessed. Subse- 
quent reports will deal with the relations 
between the scales and intelligence, other per- 
sonality test scores, peer ratings and occu- 
pational success as Salesman or as industrial 
supervisor. 

A revised 90-item form has been prepared. 


Received December 19, 1955. 


References 


. Baumgarten, F. 
measurement. 
249-261. 

. Bartlett, J. Familiar quotations. 
Brown, 1948. 

. Campbell, D. The indirect assessment of social 
attitudes. Psychol. Bull., 1950, 47, 15-38. 

. Cook, W. W., & Medley, D. M. Proposed hos- 
tility and pharisaic-virtue scales for the 
MMPI. J. appl. Psychol., 1954, 38, 414-418. 

. McClelland, D. (Ed.) Studies in motivation. 
New York: Appleton-Century, 1955. 

. Murray, H. A. Explorations in personality. 
York: Oxford Univer. Press, 1938. 

. Richmond, A. Modern quotations. 
Dover, 1947. 

. Travers, R. M. W. A critical review of the va- 
lidity and rationale of the forced-choice tech- 
nique. Psychol. Bull., 1951, 48, 62-70. 


A proverb test for attitude 
Personnel Psychol., 1952, 5, 


Boston: Little, 


New 


New York: 





Journal of Applied Psychology i 
Vol. 40, No. 6, 1956 


A Scale Measuring Attitudes Toward Working for the Government ' 


Barbara P. Aalto 


Counseling Center, University of California, Berkeley 


There is a widespread feeling on the part 
of some segments of the population that 
working for the government is looked upon 
with disfavor. Public personnel officials fre- 
quently point to these attitudes as contribut- 
ing factors in high turnover, inability to at- 
tract top-level talent to government jobs, and 
low morale among government workers them- 
selves. Since these attitudes may be factors 
in an individual’s choice of career or place of 
employment, they may be of interest to high 
school and college counselors. 

A review of the literature reveals a lack of 
reliable and valid instruments for measuring 
government employment attitudes. The study 
being described in this paper was designed 
for the purpose of constructing such a scale. 
A measure of attitudes toward government 
employment could be used in the following 
ways: 

1. To identify high school and college stu- 
dents who might find satisfaction in a gov- 
ernment career. 

2. To aid in the counseling of students in 
their choice of a career. 

3. To identify and study morale problems 
within the government service. 

4. To evaluate the general level of atti- 
tudes toward government service in the gen- 
eral population, in specific groups, and in spe- 
cific geographical areas. 

5. To study changes in these attitudes with 
education, work experience, age, and changes 
in political administration. 

Two restrictions were placed on the prob- 


1 This study was completed when the author was 
on the staff of the Student Counseling Bureau, Uni- 
versity of Minnesota. It is a condensation of a dis- 
sertation submitted to the faculty of the University 
of Minnesota in partial fulfillment of the require- 
ments for the degree of Doctor of Philosophy. The 
author wishes to express thanks to her advisor, Pro- 
fessor D. G. Paterson, for his interest and invaluable 
help in the project. The study could not have been 
completed without the generous help of a group of 
personnel directors in both government agencies and 
private business and industry who helped in the col- 
lection of the data. 


lem to keep it within manageable proportions 
and avoid confusion of results. The attitudes 
expressed were toward jobs in the federal gov- 
ernment and in the career service, not elected 
officials or members of the Armed Forces. 
The study dealt with attitudes toward work- 
ing for the government mainly at the profes- 
sional and managerial levels. White (5, 6), 
in studies of the prestige of government em- 
ployment in 1929 and 1932, found differ- 
ences in attitudes toward federal, state, and 
local government employment, and differences 
in attitudes toward jobs at various occupa- 
tional levels. 


Construction of the Scale 


Construction of the attitude scale followed 
a modified Likert-type procedure. Opinion 
statements were compiled from a variety of 
sources including previous attitude surveys, 
occupational information material, and the 
stated opinions of University of Minnesota 
graduate students. Opinions expressed in 
public administration journals and a govern- 
ment employees’ newspaper, The Government 
Standard, were also adapted for use. These 
statements were edited by the writer, three 
faculty members with experience in attitude 
scale construction, and one faculty member 
from the political science department. Half 
the statements were worded in a direction 
favorable to government employment and 
half in a negative direction. The readabil- 
ity of the items ascertained by applying the 
Flesch (1) ‘Reading Ease” formula was 
similar to that found in a digest type of 
magazine, well within the comprehension of 
high school seniors. 

A preliminary form (Form A) of 109 items 
was administered during class time to 173 
students in introductory laboratory psychol- 
ogy at the University of Minnesota. Re- 
sponses to items were-made on a 5-point scale 
from “strongly agree” to “strongly disagree.” 
Each item was scored by assigning a weight 


398 





Attitudes Toward Working for the Government 


of 5 to “strongly agree,’ 4 to “agree,” 3 
to “undecided,” 2 to “disagree,” and 1 to 
“strongly disagree” if the item was worded 
in a positive direction (favorable to govern- 
ment employment). The weights were re- 
versed for negatively stated items. The total 
score was the sum of the weights for each 
item. Results of this administration showed 
that the items elicited a wide diversity of 
attitudes toward government employment and 
were not ambiguous to the respondents. 
Sixteen items of low discriminating capacity 
were eliminated, using the Rundquist-Sletto 
(3) item-scale difference method of item 
analysis. A group of seven items were added 
to provide more positively stated items that 
would be discriminating since in the item 
analysis a larger proportion of negatively 
stated items were high in discrimination value. 

It was felt that a better scale would result 
if an item analysis were done on the basis of 
an outside criterion as well as on the basis of 
internal consistency. This procedure is in 
keeping with commonly accepted methods of 
test construction, but has less generally been 
used with the traditional methods of attitude 
scale construction of the Likert type. Gov- 
ernment workers who are satisfied with their 
jobs and assign a low rank to private em- 
ployment have been selected as representa- 
tive of individuals with attitudes most favor- 
able to working for the government. In 
contrast, it is assumed that a group of em- 
ployees in private business who are satisfied 
with their work and give a low rank to gov- 
ernment employment would represent atti- 
tudes least favorable to ‘working for the gov- 
ernment. 

The 100-item questionnaire, Preliminary 
Form B, was subsequently administered to 493 
federal government employees and 299 pri- 
vate employees. They were mainly employed 
in occupations defined in the Dictionary 
of Occupational Titles (4) as professional 
and managerial. Included with the question- 
naire were the Hoppock Job Satisfaction 
Blank (2) and a personal data sheet. The 
questionnaires were distributed by agency 
and firm personnel officers, completed anony- 
mously by employees, and returned to the 
writer in sealed envelopes. Each agency or 
firm personnel officer endeavored to get as 


399 


large a sample of employees as_ possible. 
Collection of data extended over a period 
from June, 1953, to July, 1954. This was a 
period in which the same administration (Re- 
publican) was in office, but not long after a 
change of administration. The government 
group was from Minnesota and Washington, 
D. C.; the private group was from large firms 
mainly in Minnesota, Ohio, Connecticut, and 
New Jersey. The two groups were compa- 
rable in age, sex, work experience, educational 
level, and occupational distribution. 

An item analysis was performed using both 
an internal and an external criterion. The 
internal criterion used was the top and bot- 
tom 27% in total score irrespective of place 
of employment (based on a total sample of 
733 from both government and private em- 
ployment at the time the analysis was done). 
The external criterion was a composite of job 
status and job satisfaction. The government 
criterion group (V = 249) consisted of pres- 
ently employed government workers who met 
a stated definition of being “satisfied with 
government employment.” The satisfaction 
definition was twofold: A score above the 
53rd percentile on the Hoppock Job Satisfac- 
tion Blank (2), and a rating of first or sec- 
ond choice given to present employment on a 
ranking question on place of employment (be 
own employer, educational institution, fed- 
eral government, local government, private 
business, state government). The private 
criterion group (NV = 163) was composed of 
employees of private business and industry, 
likewise satisfied with their jobs. The gov- 
ernment and private criterion groups were 
similar in age, educational, and occupational 
distribution. Each criterion group was split 
to provide a criterion group and a cross- 
validation group. Items were eliminated that 
did not differentiate on the basis of both the 
external and internal analysis and in both 
validation and cross-validation groups at the 
01 level, using chi-square analysis. In ad- 
dition, items were eliminated which did not 
show a scale value difference over .5000 using 
the Rundquist-Sletto (3) technique or which 
were clearly a duplication of statements in 
other items. Items were selected so as to 
provide an equal number of positively and 
negatively stated items. 





Barbara P. Aalto 


Table 1 
Distribution of Scores on Final Scale, Government Em- 


ployment Attitudes Scale, for 493 Government 
Workers and 299 Private Employees 











Score 


325-339 3 
310-324 5 
295-309 
280-294 75 
265-279 
250-264 
235-249 
220-234 
205-219 
190-204 
175-189 
160-174 
145-159 
130-144 
N 493 
Mean 261.9 
SD 25.6 


Government Private Total 





213.3 
29.8 


243.6 
36.0 





Statements such as “scientific and profes- 
sional groups look down on people who work 
for the government,” “there is no need for 
government to be inefficient,’ and “the fact 
of job security would make me want to work 
for the government” failed to yield signifi- 
cant results on all analyses. An item like 
“government workers are as honest as those 
privately employed” brought responses of 
“agree” or “strongly agree” from both groups, 
while both groups disagreed with the idea 
that “the government service is full of Com- 
munists.”’ 


The Final Scale 


The Final Scale,? Government Employment 
Attitudes Scale, contained 70 items. Exam- 
ples of items were as follows: In a govern- 
ment job, it is hard to make use of one’s own 
ideas; government workers keep trying to do 
a better job; a government job would be all 
right if you couldn’t get another job; good 
college students should be urged to enter 
government service. The responses of 493 
government employees and 299 private em- 
ployees were rescored on the basis of this 

2A copy of the items that comprise the Govern- 
ment Employment Attitudes Scale may be obtained 


from the author, Counseling Center, University of 
California, Berkeley 4, California. 


Final Scale. Table 1 shows the distribution 
of scores for both groups. The average score 
for government workers was 261.9 with a 
standard deviation of 25.6 and for private 
employees, 213.3 with a standard deviation 
of 29.8. The neutral point would be 210. 
The higher the score the more favorable the 
attitudes toward government employment. 
The difference in the attitudes of govern- 
ment and private business employees is of 
considerable magnitude since only about 5% 
of the privately employed were more favor- 
able in attitudes toward government employ- 
ment than the average federal government 
employee. The attitudes of the average pri- 
vate employee would be described as neutral 
rather than negative to government employ- 
ment. It is of interest to note, however, that 
on a more general measure of job satisfaction, 
the Hoppock Job Satisfaction Blank, there 
was no difference in average scores between 
the two groups. The average score for gov- 
ernment employees was 21.4 with a standard 
deviation of 2.6, while the average score for 
private employees was 21.5 with a standard 
deviation of 2.6. A score of 21 is assigned 
a percentile rank of 53 on the norms given 
by Hoppock for 309 adults, 88% of all em- 
ployed adults in New Hope, Pennsylvania, 
1933. Federal government employees do not 
appear to be the dissatisfied, low-morale group 
they are frequently reported to be. At the 
professional-managerial levels, they are as 
satisfied as their private-business counter- 
parts. 

As shown in Table 2, a reliability coeffi- 
cient of .94 (corrected by the Spearman- 
Brown formula) was found for the govern- 
ment sample and .96 (corrected) for the pri- 
vate group. These results indicate that the 


Table 2 
Reliability of the Final Scale 








Reliability 
Coeffi- 
cient 
(odd-even) 


Corrected 
Coeffi- 


Group N cient* 





Government sample 493 89 94 
Private sample * 299 92 .96 





* Corrected by the Spearman-Brown prophecy formula. 





Attitudes Toward Working for the Government 


Table 3 
Summary of Validity Data 








Groups Studied M, N F 


Diff. t 
Between 
Means 





M, M2 Weighted Observed 





Validation groups—satisfied 
gov’t (1) vs. private (2)f 

Cross-valid. groups—satisfied 
gov’t (1) vs. private (2)t 

Gov’t work ist choice (1) vs. 
private employer 1st choice 
(2)t 

Satisfied gov’t (1) workers 
vs. dissatis. gov’t (2) 

Satisfied private (1) vs. dis- 
satisfied private (2) 191 95 

Dissatisfied gov’t (1) vs. 
dissatisfied private (2) 


125 


118 
186 1.1139 
1.0790 


186 5 1.4482 


1.9794** 


2.2818** 


2.4443** 


270.1 211.0 59.1 2.6387 14.5494** 


275.6 209.9 65.7 2.6394  18.7302** 


273.0 218.4 54.7 2.6063 15.9866** 


272.8 248.8 10.4202** 


210.3 220.7 10.4 2.8726** 


248.8 220.7 28.2 8.4653** 





t Significance test used was Cochran-Cox method to test the hypothesis of equality of means with no hypothesis about the 


population variance. 
** Significant at the .01 level. 


scale measures with a high degree of consist- 
ency within each group. 

Several types of evidence indicate that the 
scale has validity and does in fact measure 
attitudes toward government employment. 


These results are summarized in Table 3. 
All differences were significant at the .01 
level. With the exception of the first two, 
these analyses are based on the responses of 
subjects who were partially or wholly not a 
part of the original item analysis and thus 
provide evidence for validity mainly inde- 
pendent of the scale construction groups. 

1. The criterion group of satisfied govern- 
ment workers differed significantly from the 
criterion group of satisfied private employees. 
Furthermore, the differences were of sufficient 
magnitude to have practical guidance and 
selection values, as shown by the fact that 
less than 1% of satisfied private employees 
reached or exceeded the r&ean of satisfied 
government workers. 

2. The cross-validation groups likewise 
showed statistically significant differences. 
None of the satisfied private employees in 
the cross-validation group reached or ex- 
ceeded the mean of satisfied government 
workers. 

3. Satisfied government workers scored sig- 
nificantly higher than dissatisfied government 


workers. Again, the difference was of con- 
siderable magnitude. 

4. Satisfied private employees scored sig- 
nificantly lower than dissatisfied private 
workers, but the difference was not as great 
as in the preceding comparison. 

5. Dissatisfied government workers scored 
much higher than dissatisfied private em- 
ployees. 

6. Workers who gave “federal government” 
employment first choice on a ranking ques- 
tion scored significantly higher than those 
who ranked “private employer” first choice. 
These differences were also large. 

7. As shown in Table 4, a correlation be- 
tween Government Employment Attitudes 
scores and the Hoppock Job Satisfaction 
Blank in the government group was + .45. 
This positive relationship, however, did not 


Table 4 


Correlation Between Government Employment Atti- 
tudes and Job Satisfaction for Government 
and Private Employees 








Correlation 
Coefficient 
Pe 
—.08 


Group N 


431 
287 





Government 
Private 





** Significant at the .01 level. 





402 Barbara 
hold for the private group. The correlation 
of — .08 was not significant. 

8. The scale items, in addition, appeared 
to have “face validity.” 


Limitation of the Scale 


It appears on the basis of this study that 
the Government Employment Attitudes Scale 
has sufficient reliability and validity to war- 
rant experimental use in counseling, selection, 
and research. Several suggestions for refine- 
ment and improvement could be mentioned 
for future study. These further studies would 
give the personnel worker greater confidence 
in his use of the instrument. The reliability 
and validity should be checked by adminis- 
tration of the Final Scale to a new group of 
subjects to insure that differences between 
satisfied government and private employees 
would hold up under further cross validation. 
Likewise, a check should be made of the re- 
liability of the scale upon repeated adminis- 
tration. 

If the measure is to be used with high school 
and college students, norms for these groups 
should be established. The relationship be- 
tween attitudes toward government employ- 
ment expressed while in school and later job 
satisfaction should be investigated in order 
to measure the predictive significance of the 
scale. 

One of the problems in the use of this scale 
for selection purposes is the possibility of 
faking responses. The items are by no means 
subtle and refer directly to attitudes toward 
government employment. This situation is 
one that does not warrant despair, but should 
be studied. The influence of faking could be 
investigated by administering the scale to an 
experimental group with the usual directions 
followed by directions to fake. A scoring 
technique might be devised that would meas- 
ure the degree of gross faking. 

The Government Employment Attitudes 
Scale was directly constructed to discrimi- 
nate between satisfied government and pri- 
vate employees in professional and manage- 
rial positions. However, the items might 
have high validity and reliability for use with 
workers in other occupational categories. Fur- 
ther study would be needed on the use of the 
scale at these other occupational levels. Like- 


P. Aalto 


wise, the responses to the items on the scale 
were in terms of federal government employ- 
ment. The hypothesis that the same items 
could be used to measure attitudes toward 
state and local government could be tested. 
The utility of the scale would be enhanced if 
it were possible to extend its use to other oc- 
cupational groups and to other levels of gov- 
ernment employment. 


Summary 


1. The purpose of the study was to con- 
struct a reliable and valid measure of atti- 
tudes toward government employment which 
could be used in counseling, selection, and 
research. 

2. A preliminary scale of 109 items was 
administered to 173 sophomore students in 
introductory laboratory psychology. Items of 
low discrimination value were eliminated. 

3. The standardization was extended to 
493 federal government employees and 299 
employees of private business and industry, 
mainly in professional and managerial occu- 
pations. An item analysis was done on the 
basis of both an internal criterion (top vs. 
bottom 27% in total score) and an external 
criterion (satisfied government workers vs. 
satisfied private employees). A_ validation 


_and cross-validation group were provided for 


each analysis. 

4. The Final Scale consisted of 70 items 
and appeared to have sufficient reliability and 
validity for further experimental use. 


Received March 20, 1956. 


References 


1. Flesch, R. A new readability yardstick. J. appl. 
Psychol., 1948, 32, 221-233. 

2. Hoppock, R. Job satisfaction. 
per, 1935. 

3. Rundquist, E. A., & Sletto, R. F. Personality 
in the depression: a study in the measurement 
of attitudes. Minneapolis: Univer. of Min- 
nesota Press, 1936. 

. U. S. Employment Service. Occupational classifi- 
cation. Vol. 2. Dictionary of occupational 
titles. Washington: U. S. Government Print- 
ing Office, 1949. 

. White, L. D. The prestige value of public em- 
ployment. Chicago: Univer. of Chicago Press, 
1929. 

. White, L. D. Further contributions to the pres- 
tige value of public employment. Chicago: 
Univer. of Chicago Press, 1932. 


New York: Har- 





Journal of Applied Psychology 
Vol. 40, No. 6, 1956 


Evaluation of a Supervisory Training Program with 
How Supervise? 


Richard P. Barthol 


University of California at Los Angeles 


and Martin Zeigler 


The Pennsylvania State University 


The problem of evaluating a training pro- 
gram, particularly training in human rela- 
tions, besets many personnel managers and 
industrial psychologists. How Supervise? has 
been offered as an instrument for this pur- 
pose. If it is so used, we must assume first 
that the test does measure information about 
desirable supervisory practices, and second 
that supervisory practices will be improved 
if the supervisors have learned those prin- 
ciples that represent approved supervisory 
practices. Karn (2) showed that a control 
group was not necessary since no significant 
changes occurred without training, in a short 
period of time. Several investigators (4, 5, 


6, 8) have shown that scores on How Super- 
vise? are related to education or intelligence, 


and furthermore (3) that the reading level of 
the test is at that of the high school graduate. 
Wickert (7) found that Form B was more 
sensitive than Form A as the posttest. 

This study is of the bootstrap variety: if a 
training program is effective and if How Su- 
pervise? is a measure of effectiveness, then a 
posttest on Form B should yield significantly 
higher scores’ than a pretest on Form A. 
Positive results would support the notion that 
the test has some validity and that the train- 
ing program was effective to some degree. 
Negative results would be inconclusive. 


Methods and Procedure 


The Westinghouse Electric Corporation conducted 
a program of supervisory training of twenty weekly 
meetings, each an hour and a half long. The sub- 
jects were 210 supervisory employees: foremen, gen- 
eral foremen, and department supervisors. They 
represented what are commonly called first- and 
second-line supervisors. Ages ranged from 25 to 60 
years. 

The conference method was used throughout. 
Prior to presenting the program, eighteen leaders 
were selected from the supervisory group and given 


a one-week intensive leader training course in the 
material to be covered. In presenting the course to 
other supervisors, each leader foilowed a common 
manual in a carefully prescribed manner. The course 
content included production control, accident pre- 
vention, cost problems, budgets, material control, 
job instructor training, and human relations. No 
attempt was made to cover the material contained 
in How Supervise?, nor did the conference leader 
have access to copies of the test, which was ad- 
ministered by the training director. There is reason 
to believe that the course was not devoted to the 
obtaining of good grades on the measuring instru- 
ment. : 

Form A of the test was administered at the be- 
ginning of the program and Form B was adminis- 
tered at the end. The scoring followed the pro- 
cedures outlined in the test manual (1). Additional 
data were collected-so that the subjects could be 
subdivided by educational level and previous super- 
visory training. 


Results 


Table 1 indicates that all groups achieved 
significantly higher scores on the posttest. 
The How Supervise? manual does not give 
adequate norms for Level II supervisors, 
which is probably the proper classification 
for this group, but it was estimated that the 
total group started the program at slightly 
below the norm mean and finished at well 
above the mean. The “college” group 
started above the norm mean and finished 
something over one standard deviation above 
the mean. As might have been expected 
from the earlier studies, the college group 
did significantly better (.001) than the ele- 
mentary or secondary groups. The difference 
between the elementary and secondary groups 
was virtually zero. The most striking educa- 
tional differences, as indicated by the stand- 
ard deviations in Table 1, were the changes 
in the variability. The college group showed 
a remarkable shortening of the distribution: 
the lowest score in the pretest was 24; the 


403 





Richard P. Barthol and Martin Zeigler 


Table 1 


Analysis of Scores on How Supervise? Before and After a Training Program 








Mean SD 


N Pre 
210 41.64 


Pre 
10.86 


Post 
50.61 


Post ta 
9.33 25.33 





Total Group 
Education* 
Elementary 19 
Secondary 96 
College 60 
Previous Training Program 
Yes 73 
No 137 


47.16 
47.80 
57.40 


40.53 
38.90 
47.43 


10.41 
11.16 
8.37 


10.35 
7.92 
4.95 


4.92 
16.55 
14.01 


42.73 
41.07 


51.11 
50.42 


10.47 
11.06 


6.53 
10.12 


13.47 
21.14 





* 35 subjects were dropped from this classification because their educational level was not reported. 


asad All values are significant beyond the .001 level. 
t = D/SXp. 


lowest score in the posttest was 43, only four 
points below the means of the other two 
groups. 

This same kind of reduced variability oc- 
curred in the group that had had previous 
training, although not quite so dramatically. 
The means of the two groups, with and with- 
out previous training, were approximately the 
same on both pre- and posttest, and each 
group showed significant improvement. How- 
ever, the standard deviation in the previously 
trained group dropped approximately four 
points while the standard deviation of the 
other group dropped only one point. 

The manual (1) gives an example of im- 
provement resulting from a training program. 
The mean of Level II supervisors was 49.3 
before training and 52.9 after training, a 
change of 3.6 points. This may be compared 
with the present change from 41.64 to 50.61, 
a difference of 8.97 points. The manual does 
not indicate any way of interpreting such dif- 
ferences. 


Discussion 


The results of this study seem to confirm 
the earlier findings cited in this paper that 
How Supervise? is more readily interpreted 
by subjects who have graduated from high 
school. However, there did not appear to be 
any significant differences between subjects 
who had gone only to elementary school and 
those who had gone to high school. Although 
all groups showed significant gains after train- 


Values of ¢ were computed by using difference scores on each subject: 


ing, the college group was the most promising 
in that almost all of this group were above 
the mean of the norm group after training. 

The large unanswered question is this: do 
these results indicate some kind of superior 
ability (or motivation) of those supervisors 
who had been to college, or does it mean that 
the test does not adequately measure im- 
provement in subjects who had not gone be- 
yond high school? Since a test of this kind 
is of great importance to organizations that 
want to measure the effectiveness of a super- 
visory training program, it is suggested that 
another study should be made that would 
parcel out such factors as age, seniority, in- 
telligence, and motivation so that we may 
know whether the readability of the instru- 
ment is a primary factor in causing differ- 
ences. Or possibly the suggestion made by 
Maloney (3) should be carried out and the 
test revised so that the problem disappears. 
Additional data should also be collected so 
that there is some way of evaluating, in ab- 
solute terms, a change due to a training pro- 
gram. 


Summary 


A group of supervisors were tested before 
and after a training program with alternate 
forms of How Supervise?. The group was 
subdivided by educational level. Although 
all groups improved significantly, the great- 
est gains were made by supervisors who had 
gone to college. Lower ranking subjects who 





Training Program Evaluation 405 


had had previous training showed more im- 
provement than the lower ranking subjects 
who had not had previous training, although 
the mean scores of the two groups were the 
same. It was suggested that the instrument 
is useful for assessing the effectiveness of a 
supervisory training program but that more 
work must be done on the readability of the 
test and on the meaning of score changes fol- 
lowing a training program. 


Received December 19, 1955. 


References 


1. File, Q. W., & Remmers, H. H. How Supervise? 
(Revised manual.) New York: Psychological 
Corporation, 1948. 

2. Karn, H. W. Performance on the File-Remmers 
Test, How Supervise?, before and after a 


course in psychology. J. appl. Psychol., 1949, 
33, 534-539. 


. Maloney, P. W. Reading ease scores for File’s 


How Supervise?. J. appl. Psychol., 1952, 36, 
225-227. 


. Millard, K. A. Is How Supervise? an intelligence 


test? J. appl. Psychol., 1952, 36, 221-224. 


. Sartain, A. Q. Relation between scores on cer- 


tain standard tests and supervisory success in 
an aircraft factory. J. appl. Psychol., 1946, 
30, 328-339. 


. Weitz, J., & Nuckols, R. C. A validation study 


of How Supervise?. J. appl. Psychol., 1953, 
37, 7-8. 


. Wickert, F. R. How Supervise? Scores before 


and after courses in psychology. J. appl. 
Psychol., 1952, 36, 388-392. 


. Wickert, F. R. Relation between How Super- 


vise?, intelligence and education for a group 
of supervisory candidates in industry. J. appl. 
Psychol., 1952, 36, 303. 





Journal of Applied Psychology 
Vol. 40, No. 6, 1956 


An Item Analysis of How Supervise? Using Both Internal and 
External Criteria 


Robert L. Decker ' 


West Virginia University 


The present investigation was designed to 
determine the.effectiveness of How Supervise? 
as a measure of supervisory ability in an in- 
dustrial situation. The method involved com- 
puting product-moment correlations between 
total scores on How Supervise? and measures 
of success in supervisory positions. In addi- 
tion, an item analysis which included meas- 
ures of item difficulty, item validity, and in- 
ternal consistency was made to determine 
how the test was functioning with the group 
of subjects under study. 

A review of the literature concerning the 
application of psychological tests in industry 
indicates that How Supervise? is widely ac- 
cepted as a measure of supervisory knowledge 
(1, 2, 9, 10, 11, 13, 16, 17, 18, 21, 26, 27). 
There are many studies indicating changes in 
scores achieved on the test as a result of su- 
pervisory training programs, courses in indus- 
trial psychology, etc. (2, 10, 11, 16, 18, 26). 
How Supervise? has also been used in the 
study of other aspects of industrial behavior 
such as attitudes, interests, etc. (14, 15, 19, 
20, 22, 23, 24, 25, 26). However, the stud- 
ies reported have not yielded sufficient infor- 
mation to justify the use of How Supervise? 
as an aid in selection, placement, or promo- 
tion, three of the most important areas for 
practical application of tests in industry. 
There is a lack of data which would offer 
reasonable support for the belief that success- 
ful supervisory performance can be predicted 
from scores on How Supervise?. 

It is true that in constructing the test the 
authors used a measure of validity (ratings 
of the members of the standardization group 
by their superiors) as well as measures of in- 
ternal consistency as bases for selecting items 
to be used in the final forms (5, 6). How- 


1 The author wishes to express his appreciation to 
Dr. Richard S. Uhrbrock, Associate Director of In- 
dustrial Relations of The Procter and Gamble Com- 
pany, for his kind suggestions and help in the for- 
mulation of the problem and treatment of the data. 


ever, File reports that it was necessary to rely 
mainly on correlation with total score as a 
basis for selecting items because the ratings 
of the members of the standardization group 
proved to be unreliable and of questionable 
validity (5, 6). 

In an article published in 1946 File and 
Remmers reviewed some studies which sug- 
gest that there is a relationship between su- 
pervisory success and scores on How Super- 
vise? (4). In one study scores achieved by 
46 supervisors were compared with those of 
14 nonsupervisors who had been bypassed 
for promotion. The group of supervisors had. 
a significantly higher score. Similar results 
were obtained in a comparison of the scores 
of 54 supervisors rated as superior with those 
of 20 supervisors who were rated as inferior. 
Since the groups of subjects were small, File 
did not regard the findings as conclusive. 

A study reported by Sartain in 1946 found 
no significant relationship between supervi- 
sory performance and scores on How Super- 
vise? (21). Forty members of supervision in 
an aircraft factory took a battery of tests 
which included the Experimental Edition 
(Form A) of How Supervise?. The-.scores 
on How Supervise? were compared with rat- 
ings of each of the supervisors by his immedi- 
ate superior, using correlation techniques, and 
no evidence of relationship was found. ' 


Procedure 
Subjects 


The subjects of the present study were 208 mem- 
bers of the supervisory staff of a large manufactur- 
ing organization. All were male college graduates 
who were hired as members of supervision during 
the ten-year period prior to the study. All subjects 
were either at the first or second level of, supervision, 
ie., their rank was the equivalent of either Foreman 
or Supervisor, at the time of the study. Most of 
the men were hired directly after graduation from 
college and all were selected on similar standards. 
All subjects had participated in on-the-job supervi- 
sory training programs. 


406 





Item Analysis of How Supervise? 


Administration of the Test 


The test used was Form M of How Supervise?, 
the form recommended by its authors for use with 
office or higher level supervisors (3). The test is 
composed of 100 items dealing with problems, prac- 
tices, and opinions related to industrial supervision. 
The test is divided into three sections, namely, Su- 
pervisory Practices, Company Policies, and Super- 
visor Opinions. In the first section the subject re- 
sponds to each of a list of 20 statements concerning 
Supervisory Practices as desirable, undesirable or 
uncertain. In a second section he responds to 32 
statements of Company Policies as desirable, unde- 
sirable, or uncertain. In a third section composed 
of 48 statements of Supervisor Opinion the subject 
is asked to express agreement, disagreement, or un- 
certainty with each opinion. The score on the test 
by the standard scoring procedure is the total num- 
ber right minus the total number wrong. The an- 
swer “uncertain” is not scored. 

The test was administered to the subjects in 
groups of from 3 to 15. The men were told that 
an evaluation study was being made of the test and 
that their cooperation was needed to “test the test.” 
They were also told that the results of the test 
would remain confidential. Participation in the 
study was voluntary. The instructions recom- 
mended by the authors of the test were followed 
in the administration of the forms (3). 


The External Criterion 


Each subject was rated on a rating scale of super- 
visory performance which has been developed by 
King and Wingert and which is published by Indus- 
trial Psychology, Inc. (12). The rating scale con- 
tains 60 questions or statements about the ratee’s 
job efficiency. The statements cover such perform- 
ance areas as quantity, quality, job knowledge, per- 
sonal work habits, potential for further develop- 
ment, etc. The rater responds to each question or 
statement by checking “Yes or True” or “Not Tiue 
at Present.” The scale is so arranged that one-half 
of the favorable responses require the rater to check 
“Ves or True” while for the other half the favor- 
able response is “Not True at Present.” Each sub- 
ject was rated by his immediate superior. The state- 
ments in the ratings scale were weighted according 
to their D values and phi values from the group on 
which the test was standardized. For the purposes 
of the present study the total weighted raw score on 
the rating scale was used as the measure of success 
as a supervisor, ie., the external criterion. 

As an additional check upon the acceptability of 
the rating scale as a measuring instrument a split- 
half reliability study was performed. Scores on the 
first half of the rating scale were compared with 
those for the second half. Application of Spearman- 
Brown techniques resulted in a corrected reliability 
coefficient of .898 (8). This was considered an ac- 
ceptable reliability by the present investigator. 


The Statistical Analyses 


The data from the study were punched into IBM 
cards to facilitate analysis. The following statistical 
analyses were performed. 

1. A product-moment coefficient of correlation was 
computed between the total score by the standard 
scoring method, ie., rights minus wrongs, on How 
Supervise? (Form M) and supervisory performance 
as measured by the total raw score on the rating 
scale. 

2. D values (fraction of the group failing) were 
computed for each item.? 

3. A biserial coefficient of correlation was com- 
puted between each item and the total number right 
on How Supervise?. The total number right was 
used instead of the rights-minus-wrongs scoring for- 
mula suggested by the authors of the test to avoid 
any variations which might result from the scoring 
formula. 

4. A biserial coefficient of correlation was com- 
puted between each item and the measure of super- 
visory success, i.e., the raw score on the rating scale. 
For both 3 and 4 above the items were scored as 
correct for the right answer or as incorrect for the 
wrong -answer, a response of “uncertain,” or no re- 
sponse. The general formula suggested by Guilford 
was used in calculating the biserial rs (8). Biserial 
correlations were used instead of point biserials or 
phi coefficients because the purpose of the analysis 
was to determine the degree of correlation between 
the quality measured by the item and that measured 
by the total raw score on the test and by the rating 
of success (8). 

5. The test records for all subjects were rescored 
on the basis of the 25 items of the test which were 
found to correlate significantly with the criterion. 
A Pearson product-moment coefficient of correlation 
was then computed between the total number right 
on these 25 items and the criterion scores. 

6. A percentile table of scores on How Supervise? 
was constructed for the 208 subjects. 


Results 


The scores on How Supervise? obtained by 
the 208 subjects ranged from 19 to 100 and 
averaged 79.61 with an SD of 10.04. The 
raw scores on the rating scale for supervisory 
performance ranged from 17 to 107 and ay- 
eraged 71.14 with an SD of 21.27. Both dis- 
tributions were tested for normality in terms 
of skewness and kurtosis (7). Neither the 
distribution of scores on How Supervise? 
nor the distribution of scores on the rating 
scale was significantly skewed. 


However, 


* This may be considered an inversion of the fre- 
quently used procedure of reporting item difficulty 
in terms of the percentage of the group answering 
the item correctly. 





408 


both distributions showed marked tendencies 
to be platykurtic. A visual inspection of the 
scatter diagram gave no indication of the 
presence of a curvilinear relationship in either 
case. 

The computation of a Pearson product- 
moment coefficient of correlation between the 
total score on How Supervise? and the total 
raw score on the scale for rating supervisory 
performance yielded an r of .108. On the 
basis of this value of r with the present data 
the rejection o the null hypothesis is not 
justifiable. The significance of the obtained 
r is well below the .181 which would be re- 
quired for the .01 level of confidence (7). 

Item difficulties or D values as measured 
by the fraction of the group failing the item 
ranged from .01 to .64 with the median at 
.10. The total number right on How Super- 


Table 1 


Results of the Item Analysis of How Supervise? 
(Form M) * 








Item 
Validity 


Internal 
Consistency 


Item 
Difficulty** 
10 O01 84 58 
12 11 AT 27 
14 Al 48 27 
15 05 53 21 
01 .60 30 
34 39 Al .22 
ae 35 ao 
37 09 43 .26 
54 03 67 32 
55 .06 57 23 
56 02 tt 30 
57 d 80 .22 
58 d 65 31 
59 : é 70 
68 j AO 67 
80 é 50 25 
83 : 58 
85 d 71 
87 j A6 
88 d 51 
90 ; 50 
92 ; ae 
96 01 60 
99 : 48 
100 01 21 








* All coefficients are significant at the .01 level of confidence. 
** Item difficulty is reported in terms of the fraction of the 
group failing the item. 


Robert L. Decker 


Table 2 


Percentile Table for 208 Industrial Supervisors 
on How Supervise? (Form M) 








Test 
Score 


Test 


Percentile Score Percentile 





99 100 78 
98 98 ; 77 
97 97 
96 96 
95 95 
94 94 
92 93 
90 92 
88 91 
86 90 
84 89 
88 
74 
67 


oo 


64 
62 
58 82 
55 81 
50 80 
45 79 


| a | 





vise? ranged from 60 to 100 and averaged 
85.65 with an SD of 7.2. The internal con- 
sistency measures (biserial rs between indi- 
vidual items and total number right on How 
Supervise?) ranged from .00 to .84 with the 
median at .40. Item validities as measured 
by the biserial rs between the individual items 
and the criterion measures ranged from — .26 
to .70, with the median obtained coefficient 
being .07.8 

Twenty-five items were found to have va- 
lidity coefficients significant at the .01 level 
of confidence. The item numbers along with 
their indices of difficulty, internal consist- 
ency, and validity are presented in Table 1. 
A test composed of these items would have a 
median difficulty of .04, a median internal 

8 The complete results of the item analysis show- 
ing measures of difficulty, internal consistency, and 
validity have been filed with the American Docu- 
mentation Institute. Order Document No. 5044 
from ADI Auxiliary Publications Project, Photo- 
duplication Service, Library of Congress, Washing- 
ton 25, D. C., remitting in advance $1.25 for micro- 
film or $1.25 for photocopies. Make checks payable 


to Chief, Photoduplication Service, Library of Con- 
gress. 





Item Analysis of How Supervise? 


consistency measure of .50, and a median va- 
lidity coefficient of .27. When the test rec- 
ords were rescored for these 25 items and the 
total number right compared with the cri- 
terion score, the r was .35 which is well above 
the .18 that would be required for signifi- 
cance at the .01 level of confidence. 

The percentile table for total score on How 
Supervise? based on the 208 subjects of this 
study is presented in Table 2. 


Discussion 


The results of the computation of r be- 
tween scores on How Supervise? and success 
in a supervisory position indicate no appar- 
ent relationship. Hence, the use of this test 
in selecting, promoting, or placing members 
of supervision under circumstances similar to 
those encountered in this study is definitely 
not warranted. A review of the validity co- 
efficients for the individual items lends fur- 
ther support to this conclusion. 

Although the correlations between the 
items which were found to be selective in 
terms of the criterion were generally low, a 
review of the statements contained in these 
items might suggest directions which future 
research should follow in dealing with the 
prediction of supervisory success. The one 
item which had a significant negative correla- 
tion was item 37. The supervisors who were 
rated as successful did not feel that “Requir- 
ing supervisors to submit in writing their rea- 
sons for firing or penalizing any employee” 
was a desirable practice. Being familiar with 
the policies and practices of the company 
whose employees served as subjects, the au- 
thor feels certain that the subjects interpreted 
the statement as applying only to penalities 
and not to firing since the termination of an 
employee would not take place without thor- 
ough investigation and documentation. The 
successful supervisor perhaps’ felt that being 
required to make a statement in writing re- 
garding any penalty, however slight, might 
be a step toward depriving him of responsi- 
bility and independence in the operation of 
his department. 

The statistical analyses of the data indi- 
cated the following tendencies among the 
supervisors who were rated less successful. 


409 


They did not feel that the following super- 
visory practices were undesirable.* 

1. “Using production records alone to de- 
termine which worker to recommend for pro- 
motion” (item 10). 

2. “Prohibiting conversation between work- 
ers on routine jobs” (14). 

3. “Making an example of one worker .to 
prevent further trouble with others” (12). 

4. “Putting a loud individual in his place 
with a sarcastic remark” (15). 

5. “Selecting supervisors according to how 
much they know about the different jobs they 
will supervise” (34). 

6. “Fining employees for 
rules” (36). 

In addition, they did not feel that “Ex- 
plaining to workers who submit nonusable 
suggestions why their ideas can not be put to 
use” was a desirable practice (16). Further, 
the less successful supervisors tended to either 
agree with or be uncertain about the follow- 
ing statements. 

1. “So-called mental fatigue is actually 
nothing but laziness” (54). 

2. “Most employees do better work if they 
get a good bawling out every so often” (55). 

3. “The only guarantee of good work is a 
fat pay envelope” (56). 

4. “Praising workers for good work only 
leads to demands for more pay” (57). 

5. “The average worker cares little about 
what others think of his job so long as the 
pay is good” (58). 

6. “The worker’s opinion of his supervisor 
is not very important” (59). 

7. “The only important requirement of a 
good supervisor is a complete understanding 
of the jobs he is to supervise” (68). 

8. “The nature of the supervisor’s job 
makes it necessary for him to be unpopular 
with his workers” (80). 

9. “The best way to handle tough workers 
is to be tougher than they are” (83). 

10. “The average supervisor can do noth- 
ing to reduce absenteeism” (85). 

11. “Constant demands on the time of top 
executives make it impractical for them to 


violations of 


*The correct response to each of these items was 
“Undesirable.” The less successful supervisors tended 
to respond to the practice as “Desirable” or to be 
“uncertain” about it. 





410 


spend any time in actual conversation with 
workers” (87). 

12. “Lectures are usually better than con- 
ferences for getting ideas across to workers” 
(88). 

13. “You can tell when a person is lying 
by noting whether or not he looks you in the 
eye” (90). 

14. “About half of the workers in our 
company are just naturally stubborn and un- 
cooperative” (92). 

15. “Supervisors should be completely re- 
lieved from duties concerning production plan- 
ning and materials handling” (96). 

16. “Rapid learners are usually quick for- 
getters” (99). 

17. “The goals of management and labor 
are directly opposed and must always be in 
conflict with each other” (100). 

The types of statements which are selective 
would seem to indicate that future research 
might well be concerned primarily with the 
elimination of individuals whose prospects of 
becoming successful supervisors are poor. The 
profile of an individual who would respond to 
the statements in the way just outlined sug- 
gests that the temperament of an individual 


may be a more important factor in his per- 
formance as a supervisor than his supervi- 


sory knowledge. The possibility that the re- 
sponses given by the unsuccessful supervisors 
tend to characterize an authoritarian person- 
ality structure should receive further investi- 
gation. 

If it is desirable to use the test in its pres- 
ent form with subjects similar to those of the 
present study, the answers might be scored 
on the basis of the 25 items listed above. 
The statistical analysis suggests that a total 
score based upon these 25 items has some 
value as a predictor of supervisory success. 
Further validation of this conclusion with 
other subjects would be desirable, however. 

The results of the present study indicate 
that the items are not sufficiently difficult to 
make Form M of How Supervise? a satisfac- 
tory test to be used with subjects having the 
backgrounds of those in this investigation. 
The most difficult questions, items 45 and 
46, were failed by 62% and 64% of the 
group, respectively. Practically all the other 


Robert L. Decker 


items were consistently passed by a large ma- 
jority of the group as indicated by the aver- 
age item difficulty of .14. All the subjects 
of the present study were college graduates. 
The question of whether or not How Super- 
vise? could be used in selection, placement, 
upgrading, etc., of non-college graduates still 
remains unanswered. Studies in this area 
should include some control of intelligence 
since reports made by Millard (14) or Wick- 
ert (27) suggest that How Supervise? may 
be a test of intelligence for individuals below 
certain educational levels. 

The biserial rs between total number right 
on How Supervise? and each item are mostly 
low but positive and, except for four items, 
significant at the .01 level of confidence. 
The general indication is that the items are 
consistently measuring some quality. It seems 
conceivable, in the light of the face validity 
of the items, that this quality could be su- 
pervisory knowledge. If this were true then 
the results of the present study might be 
taken as an indication that supervisory knowl- 
edge of the range and level measured by How 
Supervise? is not a factor in achieving suc- 
cess as a supervisor. 

The percentile table in Table 2 is pre- 
sented as a standard of performance of col- 
lege graduates engaged in supervisory posi- 
tions on How Supervise?. It is interesting to 
note that the median of this group falls at a 
score of 80 while the median score for “top 
management supervisors” in the standardiza- 
tion group used by the authors of the test is 
72 (3). 


Summary and Conclusions 


Two hundred and eight college graduates 
who were members of the supervisory staff 
of a large manufacturing organization took 
Form M of How Supervise? and were rated 
for supervisory performance on the rating 
scale devised and published by King and 
Wingert (12). Statistical analysis indicated 
no relation between scores on How Supervise? 
and rated success in a supervisory position. 
An item analysis indicated that the items con- 
sistently measured some quality, possibly su- 
pervisory knowledge. The items in the test 
were found to be too easy for the group of 








Item Analysis of How Supervise? 


subjects and for the most part not valid pre- 
dictors of supervisory success as measured 
under the conditions of the present study. 
Test records for the subjects were rescored 
on the basis of the 25 items which had sig- 
nificant coefficients of validity. The r be- 
tween total number right on these items and 
the criterion was found to be .35. 

The following conclusions were drawn: 

1. Under the conditions of the present 
study, scores on How Supervise? do not pre- 
dict success in a supervisory position. 

2. The items in Form M of How Super- 
vise? are not sufficiently difficult for college 
graduates having the backgrounds of the sub- 
jects of the present study. 

3. Form M of How Supervise? is consist- 
ently measuring some quality, possibly super- 
visory knowledge. 


Received January 23, 1956. 


References 


. Belman, H. S., & Evans, R. N. Selection of stu- 
dents for a trade and industrial education 
curriculum. J. educ. Psychol., 1951, 42, 52- 
58. 

. Canter, R. R., Jr. A human relations training 
program. J. appl. Psychol., 1951, 35, 421-425. 

. File, Q. W., & Remmers, H. H. Manual for 
How Supervise?. (Rev. Ed.) New York: 
Psychological Corporation, 1948. 

. File, Q. W., & Remmers, H. H. Studies in su- 
pervisory evaluation. J. appl. Psychol., 1946, 
30, 421-425. 

. File, Q. W. The measurement of supervisory 
quality in industry. Unpublished doctor’s 
dissertation, Purdue Univer., 1944. 

. File, Q. W. The measurement of supervisory 
quality in industry. J. appl. Psychol., 1945, 
29, 323-337. 

. Garrett, H. E. Statistics in psychology and 
education. New York: Longmans, Green, 
1948. Pp. 220, 299, 347-352. 

. Guilford, J. P. Fundamental statistics in psy- 
chology and education. New York: McGraw- 
Hill, 1950. Pp. 209, 324, 492, 499. 

. Jurgensen, C. E. Foreman training based on 
the test How Supervise?. Personnel J., 1949, 
28, 123-127. , 

. Karn, H. W. Performance on the File-Remmers 
test, How Supervise?. J. appl. Psychol., 1949, 
33, 534-539. 


411 


11. Katzell, R. A. Testing a training program in 
human relations. Personnel Psychol., 1948, 
1-2, 319-329. 

. King, J. E., & Wingert, Judith W. 
ing series—performance-supervisor. 
Industrial Psychology, Inc., 1953. 

. Millard, K. A. A personnel study of supervi- 
sors in business and industry. Unpublished 
doctor’s dissertation, Univer. of Minnesota, 
1947. 

. Millard, K. A. 
gence test? 
224. 

. Miller, F.. & Remmers, H. H. Studies in indus- 
trial empathy: II. Management’s attitude to- 
wards industrial supervisors and their esti- 
mates of labor’s attitude. Personnel Psychol., 
1950, 3, 33-40. 

. Mosel, J. N., & Tsacnaris, H. J. Evaluating the 
supervisory training program. J. Personn. 
Adm. industr. Relat., 1954, 1, 99-104. 

. Mosier, C. I. Review of How Supervise?. In 
O. K. Buros (Ed.), The third mental meas- 
urements year-book. New Brunswick: Rut- 
gers Univer. Press, 1949. Pp. 727-728. 

. Pond, Bette B. Performance on File-Remmers 
How Supervise? test before and after super- 
visory training. Unpublished master’s thesis, 
Pennsylvania State College, 1951. 

. Remmers, H. H., Remmers, L., & Miller, F. 
A quantitative study of reciprocal empathy 
of labor leaders and industrial management. 
Amer. Psychologist, 1949, 4, 282-283. (Ab- 
stract) 

. Remmers, Lois J., & Remmers, H. H. Studies in 
industrial empathy: I. Labor leaders’ attitudes 
toward industrial supervision and their esti- 
mates of managements attitudes. Personnel 
Psychol., 1949, 2, 427-436. 

. Sartain, A. Q. Relation between scores on cer- 
tain standard tests and supervisory success in 
an aircraft factory. J. appl. Psychol., 1946, 
30, 328-339. 

. Slocombe, C. S. Appraisal of Mr. File’s study. 
Personnel. J., 1946, 24, 251-254. 

3. Speroff, B. J. Relationship between empathic 
ability and supervisory knowledge. J. Per- 
sonn. Adm. industr. Relat., 1954, 1, 195-197. 

. Van Zelst, R. H. Empathy test scores of union 
leaders. J. appl. Psychol., 1952, 36, 293-295. 

. Whyte, W. H., Jr. The fallacies of personality 
testing. Fortune, 1954, 50, No. 3 (Sept.). 

. Wickert, F. R. How Supervise? scores before 
and after a course in psychology. J. appl. 
Psychol., 1952, 36, 388-392. 

. Wickert, F. R. Relationship between How Su- 
pervise?, intelligence and education for a 
group of supervisory candidates in industry. 
J. appl. Psychol., 1952, 36, 301-303. 


Merit rat- 
Chicago: 


Is How Supervise? an intelli- 
J. appl. Psychol., 1952, 36, 221- 





Journal of Applied Psychology 
Vol. 40, No. 6, 1956 


Preference Measurement by the Methods of Successive Intervals 
and Monetary Estimates 


Purnell H. Benson 


Drew University 
and John H. Platten, Jr. 
J. A. Ward, Inc. 


The methods of successive intervals and 
paired comparisons have been proposed to 
measure product preferences in applying the 
marginal preference model to consumer be- 
havior (1). These methods satisfy the rule 
of addition that the sum of preference differ- 
ences AB and BC equals AC for any three 
points along a linear qualitative continuum 
(2). 

This rule may not be a sufficient condition 
for the method of measurement used to es- 
tablish the relationship of equal marginal 
preferences in applying the model. For large 
ranges in preference the magnitude of the 
error in expressing preferences between quali- 
tative points may change. The error in choos- 
ing between a $101 article and a $102 ar- 
ticle is apparently greater than the error in 
choosing between a $1 and a $2 article. In 
using .measurement methods based upon a 
changing judgmental error, the relationship 
of equal marginal preferences will not hold 
accurately unless the range of preference 
variation included in the measurements is 
relatively limited. 

Furthermore, it may be noted that, in 
using the methods of successive intervals or 
paired comparisons, the rule of addition is 
satisfied by the measurement of preferences 
for single articles, but not by the measure- 
ment of preferences for combinations of ar- 
ticles presented simultaneously as stimuli for 
choice. If the preference difference between 
a $101 article and a combination of a $101 
article and a $1 article is measured, the con- 
sumer can separate the combination in for- 
mulating his judgments. The judgmental 
error is apparently not then the same as if 
a $101 and a $102 article are compared. 
Methods of preference measurement based 
upon this error as a unit satisfy the rule of 


412 


addition for unilinear series, but not for 
multilinear series in which there are alterna- 
tive routes of judgment in proceeding from 
one point to another in the hierarchy of 
preferences. 

A method of measuring preference in which 
the metric unit does not change for objects 
of different monetary value is provided by 
asking the individual to estimate how much 
he would be willing to pay to secure the ob- 
ject of his preference. Such a method of 
measurement possesses little novelty, but 
seems useful in supplying the corrective fac- 
tor for adjusting the judgmental error unit 
used in the methods of successive intervals 
and paired comparisons. 

This paper examines the relationship be- 
tween preferences as measured by the method 
of successive intervals and preferences as 
measured by monetary estimate. If the re- 
lationship is linear, this means that the judg- 
mental error unit remains constant relative 
to changes in monetary value. If the rela- 
tionship is nonlinear, the required correction 
factor can be defined for changes in the 
measuring unit when preferences for cheap 
and expensive articles are compared. 


Description of the Data 


A questionnaire containing two parts was 
administered to 102 individuals of middle 
socioeconomic background who were inter- 
viewed outside of neighborhood supermar- 
kets. The first portion of the questionnaire 
read to respondents was: 


I have in my hand a list of brand-new articles of 
equal retail value. Imagine that you are at some 
big affair where these articles are being given away 
free as door prizes. They are not to be re-sold by 
the people who win them. As I read each one to 
you, tell me how much you would like to win it. 








Preference Measurement 


The ten articles on the list are: 


A $50 
A $50 


Rug 

Radio 

TV set 

Camera 

Bicycle 

Portable typewriter 
Set of china 

Easy chair 
Vacuum cleaner 
Dress 


$50 
$50 
$50 
$50 
$50 
$50 
$50 
The preference categories to be checked are: 


Like extremely 
Like very much 
Like somewhat 
Like very little 


In the second portion of the questionnaire, 
the instruction to respondents was: 


Now, let’s imagine you are at an auction sale, and 
these same brand-new articles are being auctioned 
off to the highest bidder. For each one of them tell 
me what is the most money you would be likely to 
bid for it. Once again remember that none of them 
could be re-sold. You would have to be getting 
them for yourself or your family only. 


The monetary values selected and the cor- 
responding categories checked are tabulated 
in Table 1. The ten articles and the 102 re- 
spondents provide 1020 observations, six of 
which were removed because of “don’t know” 
replies. Class intervals were selected to pro- 
vide an even distribution of monetary values 


SCALE VALUE 


SUCCESSIVE INTERVAL 





4 


20 30 40 





MONETARY ESTIMATE IN DOLLARS 


Fic. 1. Relationship of preference-scale values ob- 
tained by the method of successive intervals and 
preference-scale values given by monetary estimates. 


and a complete matrix for computation of 
scale values by the method of successive in- 
tervals. 

An exponential function of the type Y = 
aZ® + c was used to fit the relationship be- 
tween scale values and monetary estimates. 
The constants for this were found by calcu- 
lating the coefficient of correlation for various 
values of } and deriving the maximum. The 
equation obtained is 


Y = 428 25, [1] 


where Z is the monetary estimate and Y is 
the preference measured by the method of 


Table 1 


Scale Values Obtained by the Method of Successive Intervals Compared with Monetary Estimates 








Monetary Estimate 





Category of Preference 





Like 
Extremely 


Class 
Mean 


Class 
Interval 


Like 
Very Much 


Like 
Somewhat Very Little 


Like Scale 


Value* 





$0 
$ 4.52 
$ 9.85 
$14.58 
$19.49 
$24.86 
$29.79 
$40.60 


$0 
$ 1-5 
$ 6-10 
$11-15 
$16-20 
$21-25 
$26-30 
$31-50 


25 
37 


28 30 343 
13 74 
25 49 1.34 
19 1 1.70 
6 2.23 
10 2.13 
3 2.48 
9 2.45 


09 
54 





* Origin for scale values was selected to eliminate one of the constants in the functional relationship fitted to scale values and 


monetary estimates. 





414 


successive intervals. The origin for the scale 
values of Y was chosen after analysis to elimi- 
nate the c constant. The curve and the 
points upon which it is based are given in 
Fig. 1. 


Discussion and Conclusion 


The principle that consumers make pur- 
chases at those points on their buying con- 
tinuum where marginal preferences are equal 
assumes that preferences measured for arti- 
cles of different costs are based upon the 
same sized measuring unit. When the method 
of successive categories or other method based 
upon a judgmental error unit is applied over 
a range in preference variation which is not 
small in extent, a correction is needed. This 
correction can be supplied by means of the 


formula 
y’= y', [2] 


in which case the relationship between Y’, 
the corrected preference measurement, and Z, 
the monetary value, becomes linear. 

The value found for 5 and the type of func- 
tion utilized here are to be regarded tenta- 
tively until more complete studies have been 
made. The value for the linear constant a 
apparently depends upon the amount of in- 
come which the consumer has for disposal. 
Research is needed to establish the values 
for a appropriate to different income levels 
and demands upon family resources. 

Alternatively, the monetary estimate may 


Purnell H. Benson and John H. Platten, Jr. 


be used as a method of measuring preference 
in the application of the marginal preference 
model to consumer behavior. Since mone- 
tary amounts are uniformly additive, no cor- 
rection is apparently needed from this stand- 
point. Estimates made by the consumer 
usually require qualification if they are taken 
as indicative of actual buying behavior. 

It remains to be disclosed whether mone- 
tary estimates provide as precise preference 
measurements as the values obtained by the 
methods of successive categories or paired 
comparisons. In general, those questionnaire 
data which require the respondent to make 
complex rather than simple judgments are 
less precise, and for this reason data from 
choices rather than money estimates may 
provide the more precise measurements, with 
such studies as the one reported here giving 
the corrective adjustment. The size of the 
correlation coefficient found here, .968, sug- 
gests that comparable results can be obtained 
by either method for measuring consumer 
preferences. 


Received April 25, 1956. 


References 


1. Benson, P. H. A model for the analysis of con- 
sumer preference and an exploratory test. J. 
appl. Psychol., 1955, 39, 375-381. 

2. Gulliksen, H. O. Paired comparisons and the 
logic of measurement. Psychol. Rev., 1946, 
53, 199-213. 





Journal of Applied Psychology 
Vol. 40, No. 6, 1956 


The Relationship Between Chi Square and Size of Sample: 
the General Case 


Herbert D. Kimmel 


Human Factors Research, Inc., Los Angeles 


In an earlier paper (3) a procedure was sug- 
gested for achieving some control over the 
number of independent observations which 
might be required to obtain statistical signifi- 
cance in studies using a chi-square test in two- 
celled tables. The suggestion was based on 
the fact that, in such cases, the conventional 
expression for computing chi square reduces to 
4(fo—f.)?/N. By substituting the value of chi 
square required at the desired level of signifi- 
cance, the relationship between V and f,—f, 
was defined and could be described graphically. 
Thus the significance of obtained differences 
could be obtained quickly by visual inspection 
of the graph. It was shown that the use of 
this graph would enable experimenters to plan 
data collection in phases of 10, 20, or 30 Ss at 
a time. After each period of data collection, 


it could be determined from the graph whether 


the difference between the observed frequencies 
and those expected by chance were great 
enough to exceed the desired level of signifi- 
cance. In this way the size of N could be 
kept reasonably close to the minimum required 
by the actual difference in the population rather 
than fixed arbitrarily on a priori grounds. 

The present paper extends the logic under- 
lying this procedure to the general chi-square 
situation, regardless of the number of cells in 
the table. It should be noted, in making the 
transition from the special to the general case, 
that the fortuitous disappearance of the sum- 
mation operator in the special case does not 
occur in the general case. 

The conventional expression, 


x? = z[ 452), 


5 (1) 


Table 1 
2 fo? Required for Significance at .05 and .01 Levels for Several Vs and ns* 
(.01 level in boldface) 








Number 
of Cells 


Number of Observations (V) 





(n) 30 40 50 


60 70 80 90 





359.91 613.21 933.18 
392.10 656.13 986.83 
282.61 478.15 722.69 
310.06 513.41 766.76 
236.93 395.90 594.88 
259.66 426.22 632.75 
205.35 340.47 508.92 
225.43 367.24 542.35 
300.52 447.08 

324.64 477.23 

270.34 400.42 

292.38 427.97 

363.93 

389.33 

334.60 

358.33 


3533.03 
3640.30 
2695.38 
2783.50 
2189.76 
2265.54 
1851.17 
1918.10 
1608.45 
1668.74 
1425.84 
1480.94 
1283.41 
1335.44 
1169.19 
1216.66 


1319.82 1773.12 2293.09 2879.73 
1384.20 1848.23 2378.93 2976.30 
1017.22 1361.76 1756.30 2200.84 
1070.12 1423.47 1826.82 2280.17 
833.86 1112.83 1431.81 1718.78 
879.32 1165.88 1492.43 1858.99 
710.70 945.82 1214.27 1516.05 
750.86 992.67 1267.81 1576.29 
622.22 825.92 1058.19 1319.04 
658.39 868.12 1106.42 1373.30 
555.50 735.59 940.67 1170.75 
588.56 774.16 984.75 1220.34 
503.38 665.05 848.95 1055.07 
533.93 700.70 889.69 1100.90 | 
461.51 608.43 975.35 962.27 
490.00 641.66 813.23 1004.99 





* No values have been given for situations in which the theoretical cell frequencies are less than 5. 


recommended that a direct method of computing P be used. 


In these situations it is 


It should also be noted that the values given in this table have not 
been corrected for discontinuity in the case of small theoretical frequencies. 


According to Guilford, the correction should be 


applied in cases involving theoretical frequencies of 25 or less but only in two-celled tables (2, p. 279). 


415 





416 


has been shown by Cramér (1, p. 417) to be 
identical with 


2 


[2] 


in which V = number of observations and n = 
number of cells. Solving for >> f.?, this be- 
comes, 


x 


nd fo — N’ 
N 


Nx? + N? 
n 


Lie = [3] 

For any particular situation, once an accept- 
able level of significance has been decided upon, 
the left-hand expression in Equation 3 can be 
specified for any value of V. A table or graph 
may be made before data collection begins, 
describing the relationship between >> f,? and 
sample size (V) at the chosen level of signifi- 
cance. Then it is only necessary to square the 
observed frequencies and compare the sum of 
these squares with the value required for sig- 
nificance for that size of sample. The data 
collection could be terminated when this value 
was attained. 

For example, suppose a preference experi- 


1 Notation changed; Cramér’s p; is constant. 


Herbert D. Kimmel 


ment required subjects to choose the most pre- 
ferred of four stimuli. A chance hypothesis 
would predict V/4 preferences for each stimu- 
lus. Assuming that the .01 level of significance 
were required for rejection of this chance hy- 
pothesis, x? with 3 df would have to equal or 
exceed 11.341. Substituting these values in 
equation 3, 


11.341 V + NV? 
2.» 
Lhe 4 ‘ 


It can be seen that the minimum necessary 
value of }-f.? to obtain significance can be 
specified beforehand for any .V. Table 1 illus- 
trates this relationship for the .05 and .01 levels 
of significance in situations involving anywhere 
from 3- to 10-celled tables. 


Received A pril 14, 1956. 


References 


1. Cramér, H. Mathematical methods of statistics. 
Princeton: Princeton Univer. Press, 1951. 

2. Guilford, J. P. Fundamental statistics in psychology 
and education. New York: McGraw-Hill, 1950. 

3. Kimmel,H.D. The relationship between chi square 
and size of sample in two-celled tables. J. appl. 
Psychol., 1956, 40, 61-62. 











REVUE DE PSYCHOLOGIE APPLIQUEE 


PUBLICATION TRIMESTRIELLE 


Directeurs : D' P. PICHOT et P. RENNES 


Cette Revue s’adresse aussi bien aux cliniciens (psychologues ou psychiatres), 
qu’aux psychotechniciens (orienteurs, psychologues de la pen ont al 9 

Deux rubriques sont orientées vers l’application : Techniques et Méthodes 
en psychologie de la profession et Techniques et Méthodes en psychologie 
clinique. Ces rubriques ont pour but d’exposer sous une forme précise et con- 
créte les techniques fondamentales, d’éclairer des points douteux, de présenter, 
méme sous forme d’aide-mémoire, les méthodes pratiques de conduite des appli- 
cations. Elles sont complétées par des Revues générales qui permettent de faire 
le point des recherches dans des domaines intéressant directement I’application. 
Dans - rubrique Travaux originaux prennent place des études d’ordre plus 
général. 

Enfin les autres rubriques Chroniques et Documentation et Analyses don- 
nent, tant sur le plan technique que sur le plan professionnel, un tableau de la 
vie quotidienne en psychologie appliquée. 


Rédaction et Administration : 15, rue Henri-Heine - PARIS (XVI*) 
C. C. PARIS 5851-62 


ABONNEMENTS :1 an, France: 1.000 francs - Etranger: 1.300 francs 
NUMERO SPECIMEN SUR DEMANDE 

















Applied 


Psychologists 


The Heavy Military Electronic Equip- 
ment Department of General Elec- 
tric has a number of unusual op- 
portunities for psychologists with 
training and experience in human en- 
gineering, industrial psychology and 
training techniques. 


The psychologists will participate in 
the development and field evaluation 
of large missile guidance systems 
with responsibility for: 

Equipment Design 
Personnel Requirements 
Operations 
Training 
Knowledge of electronic equipment 

and its operation highly desirable. 
A Master’s and three years of experi- 
ence or a Ph.D. essential. 
Write in confidence to: 
Mr. John 8. Brady 
Heavy Military Electronic 
Equipment Dept. 


GENERAL @® ELECTRIC 


Court Street, Syracuse, N. Y. 














